Supported tasks
TaskTypes enumerates every Florence-2 task this wrapper recognises. Each one produces a FlorenceResults; which fields are populated depends on the task. The table below is the cheat-sheet.
Quick reference
TaskTypes value |
What it does | Read from FlorenceResults |
|---|---|---|
CAPTION |
Short caption describing the whole image. | PureText |
DETAILED_CAPTION |
Longer caption with more detail. | PureText |
MORE_DETAILED_CAPTION |
Verbose, paragraph-length description. | PureText |
OCR |
Read all text in the image as plain text. | PureText |
OCR_WITH_REGION |
Read all text and return the quad-box around each region. | OCRBBox |
OD |
Object detection: bounding boxes with class labels. | BoundingBoxes |
DENSE_REGION_CAPTION |
Many bounding boxes, each captioned. | BoundingBoxes |
REGION_PROPOSAL |
Bounding boxes for salient regions, no labels. | BoundingBoxes |
CAPTION_TO_PHRASE_GROUNDING |
Given a phrase, highlight matching regions in the image. Pass the phrase as textInput. |
BoundingBoxes |
OPEN_VOCABULARY_DETECTION |
Detect objects matching a text prompt. Pass the prompt as textInput. |
BoundingBoxes |
REGION_TO_SEGMENTATION |
Given a bounding box, produce a segmentation polygon. Pass the box in textInput using <loc_…> tokens. |
Polygons |
REGION_TO_CATEGORY |
Classify the object inside a given bounding box. | PureText |
REGION_TO_DESCRIPTION |
Describe the object inside a given bounding box. | PureText |
REGION_TO_OCR |
Read text from a specific bounding box. | PureText |
REFERRING_EXPRESSION_SEGMENTATION |
Segment the region matching a referring expression. Not currently working — see notes. | Polygons |
Tasks that need a textInput
Most tasks ignore the third argument to Run. These three require it:
CAPTION_TO_PHRASE_GROUNDING—textInputis the natural-language phrase to ground. Example:"the red car".OPEN_VOCABULARY_DETECTION—textInputis the open-vocabulary class to detect. Example:"traffic light".REGION_TO_*tasks —textInputencodes the region using<loc_xxx>tokens in Florence-2's 1000-bin coordinate space. See OCR for how to compute them from pixel coordinates.
For everything else, pass null or just omit the argument.
The result types
public class FlorenceResults
{
public LabeledOCRBox[] OCRBBox; // OCR_WITH_REGION
public string PureText; // captions, plain OCR, region descriptions
public LabeledBoundingBoxes[] BoundingBoxes; // OD, grounding, region proposal
public LabeledPolygon[] Polygons; // segmentation tasks
}
public class LabeledBoundingBoxes
{
public BoundingBox<float>[] BBoxes;
public string Label;
}
public class LabeledOCRBox
{
public Coordinates<float>[] QuadBox; // four corner points
public string Text;
}
public class LabeledPolygon
{
public string Label;
public List<Coordinates<float>> Polygon;
public List<BoundingBox<float>> BBoxes;
}
Coordinates are in pixel space of the original image. The wrapper de-normalises Florence-2's <loc_…> tokens for you.
Picking the right task
- "What's in this picture?" → start with
CAPTION, escalate toDETAILED_CAPTIONorMORE_DETAILED_CAPTION. - "Read this receipt / sign / screenshot." →
OCRfor plain text,OCR_WITH_REGIONif you also want bounding boxes. - "Where are the cars / people / cats?" →
OPEN_VOCABULARY_DETECTIONwith a class prompt, orODfor generic detection. - "Highlight the part of the image that matches this sentence." →
CAPTION_TO_PHRASE_GROUNDING. - "Describe the object inside this box." →
REGION_TO_DESCRIPTIONwith a<loc_…>textInput.
For full code examples task by task, see the Guides.
A note on REFERRING_EXPRESSION_SEGMENTATION
REFERRING_EXPRESSION_SEGMENTATION is enumerated for completeness but does not produce valid output in the current release — the generated <loc_…> token stream is not well-formed. Use CAPTION_TO_PHRASE_GROUNDING to get a bounding box, then call REGION_TO_SEGMENTATION for a polygon as a workaround.