Supported tasks

TaskTypes enumerates every Florence-2 task this wrapper recognises. Each one produces a FlorenceResults; which fields are populated depends on the task. The table below is the cheat-sheet.

Quick reference

`TaskTypes` value	What it does	Read from `FlorenceResults`
`CAPTION`	Short caption describing the whole image.	`PureText`
`DETAILED_CAPTION`	Longer caption with more detail.	`PureText`
`MORE_DETAILED_CAPTION`	Verbose, paragraph-length description.	`PureText`
`OCR`	Read all text in the image as plain text.	`PureText`
`OCR_WITH_REGION`	Read all text and return the quad-box around each region.	`OCRBBox`
`OD`	Object detection: bounding boxes with class labels.	`BoundingBoxes`
`DENSE_REGION_CAPTION`	Many bounding boxes, each captioned.	`BoundingBoxes`
`REGION_PROPOSAL`	Bounding boxes for salient regions, no labels.	`BoundingBoxes`
`CAPTION_TO_PHRASE_GROUNDING`	Given a phrase, highlight matching regions in the image. Pass the phrase as `textInput`.	`BoundingBoxes`
`OPEN_VOCABULARY_DETECTION`	Detect objects matching a text prompt. Pass the prompt as `textInput`.	`BoundingBoxes`
`REGION_TO_SEGMENTATION`	Given a bounding box, produce a segmentation polygon. Pass the box in `textInput` using `<loc_…>` tokens.	`Polygons`
`REGION_TO_CATEGORY`	Classify the object inside a given bounding box.	`PureText`
`REGION_TO_DESCRIPTION`	Describe the object inside a given bounding box.	`PureText`
`REGION_TO_OCR`	Read text from a specific bounding box.	`PureText`
`REFERRING_EXPRESSION_SEGMENTATION`	Segment the region matching a referring expression. Not currently working — see notes.	`Polygons`

Tasks that need a `textInput`

Most tasks ignore the third argument to Run. These three require it:

CAPTION_TO_PHRASE_GROUNDING — textInput is the natural-language phrase to ground. Example: "the red car".
OPEN_VOCABULARY_DETECTION — textInput is the open-vocabulary class to detect. Example: "traffic light".
REGION_TO_* tasks — textInput encodes the region using <loc_xxx> tokens in Florence-2's 1000-bin coordinate space. See OCR for how to compute them from pixel coordinates.

For everything else, pass null or just omit the argument.

The result types

public class FlorenceResults
{
    public LabeledOCRBox[]        OCRBBox;       // OCR_WITH_REGION
    public string                 PureText;      // captions, plain OCR, region descriptions
    public LabeledBoundingBoxes[] BoundingBoxes; // OD, grounding, region proposal
    public LabeledPolygon[]       Polygons;      // segmentation tasks
}

public class LabeledBoundingBoxes
{
    public BoundingBox<float>[] BBoxes;
    public string               Label;
}

public class LabeledOCRBox
{
    public Coordinates<float>[] QuadBox;  // four corner points
    public string               Text;
}

public class LabeledPolygon
{
    public string                   Label;
    public List<Coordinates<float>> Polygon;
    public List<BoundingBox<float>> BBoxes;
}

Coordinates are in pixel space of the original image. The wrapper de-normalises Florence-2's <loc_…> tokens for you.

Picking the right task

"What's in this picture?" → start with CAPTION, escalate to DETAILED_CAPTION or MORE_DETAILED_CAPTION.
"Read this receipt / sign / screenshot." → OCR for plain text, OCR_WITH_REGION if you also want bounding boxes.
"Where are the cars / people / cats?" → OPEN_VOCABULARY_DETECTION with a class prompt, or OD for generic detection.
"Highlight the part of the image that matches this sentence." → CAPTION_TO_PHRASE_GROUNDING.
"Describe the object inside this box." → REGION_TO_DESCRIPTION with a <loc_…> textInput.

For full code examples task by task, see the Guides.

A note on `REFERRING_EXPRESSION_SEGMENTATION`

REFERRING_EXPRESSION_SEGMENTATION is enumerated for completeness but does not produce valid output in the current release — the generated <loc_…> token stream is not well-formed. Use CAPTION_TO_PHRASE_GROUNDING to get a bounding box, then call REGION_TO_SEGMENTATION for a polygon as a workaround.