Florence2-Sharp

Supported tasks

TaskTypes enumerates every Florence-2 task this wrapper recognises. Each one produces a FlorenceResults; which fields are populated depends on the task. The table below is the cheat-sheet.

Quick reference

TaskTypes value What it does Read from FlorenceResults
CAPTION Short caption describing the whole image. PureText
DETAILED_CAPTION Longer caption with more detail. PureText
MORE_DETAILED_CAPTION Verbose, paragraph-length description. PureText
OCR Read all text in the image as plain text. PureText
OCR_WITH_REGION Read all text and return the quad-box around each region. OCRBBox
OD Object detection: bounding boxes with class labels. BoundingBoxes
DENSE_REGION_CAPTION Many bounding boxes, each captioned. BoundingBoxes
REGION_PROPOSAL Bounding boxes for salient regions, no labels. BoundingBoxes
CAPTION_TO_PHRASE_GROUNDING Given a phrase, highlight matching regions in the image. Pass the phrase as textInput. BoundingBoxes
OPEN_VOCABULARY_DETECTION Detect objects matching a text prompt. Pass the prompt as textInput. BoundingBoxes
REGION_TO_SEGMENTATION Given a bounding box, produce a segmentation polygon. Pass the box in textInput using <loc_…> tokens. Polygons
REGION_TO_CATEGORY Classify the object inside a given bounding box. PureText
REGION_TO_DESCRIPTION Describe the object inside a given bounding box. PureText
REGION_TO_OCR Read text from a specific bounding box. PureText
REFERRING_EXPRESSION_SEGMENTATION Segment the region matching a referring expression. Not currently working — see notes. Polygons

Tasks that need a textInput

Most tasks ignore the third argument to Run. These three require it:

  • CAPTION_TO_PHRASE_GROUNDINGtextInput is the natural-language phrase to ground. Example: "the red car".
  • OPEN_VOCABULARY_DETECTIONtextInput is the open-vocabulary class to detect. Example: "traffic light".
  • REGION_TO_* tasks — textInput encodes the region using <loc_xxx> tokens in Florence-2's 1000-bin coordinate space. See OCR for how to compute them from pixel coordinates.

For everything else, pass null or just omit the argument.

The result types

public class FlorenceResults
{
    public LabeledOCRBox[]        OCRBBox;       // OCR_WITH_REGION
    public string                 PureText;      // captions, plain OCR, region descriptions
    public LabeledBoundingBoxes[] BoundingBoxes; // OD, grounding, region proposal
    public LabeledPolygon[]       Polygons;      // segmentation tasks
}

public class LabeledBoundingBoxes
{
    public BoundingBox<float>[] BBoxes;
    public string               Label;
}

public class LabeledOCRBox
{
    public Coordinates<float>[] QuadBox;  // four corner points
    public string               Text;
}

public class LabeledPolygon
{
    public string                   Label;
    public List<Coordinates<float>> Polygon;
    public List<BoundingBox<float>> BBoxes;
}

Coordinates are in pixel space of the original image. The wrapper de-normalises Florence-2's <loc_…> tokens for you.

Picking the right task

  • "What's in this picture?" → start with CAPTION, escalate to DETAILED_CAPTION or MORE_DETAILED_CAPTION.
  • "Read this receipt / sign / screenshot."OCR for plain text, OCR_WITH_REGION if you also want bounding boxes.
  • "Where are the cars / people / cats?"OPEN_VOCABULARY_DETECTION with a class prompt, or OD for generic detection.
  • "Highlight the part of the image that matches this sentence."CAPTION_TO_PHRASE_GROUNDING.
  • "Describe the object inside this box."REGION_TO_DESCRIPTION with a <loc_…> textInput.

For full code examples task by task, see the Guides.

A note on REFERRING_EXPRESSION_SEGMENTATION

REFERRING_EXPRESSION_SEGMENTATION is enumerated for completeness but does not produce valid output in the current release — the generated <loc_…> token stream is not well-formed. Use CAPTION_TO_PHRASE_GROUNDING to get a bounding box, then call REGION_TO_SEGMENTATION for a polygon as a workaround.

© 2026 Florence2-Sharp. All rights reserved.