The Florence-2 model

Florence-2 is Microsoft Research's unified vision-language foundation model. A single model checkpoint handles captioning, OCR, object detection, dense region captioning, phrase grounding, and segmentation — tasks that traditionally required one specialised model each.

This page sketches what Florence-2 does, how it works in broad strokes, and which parts the C# wrapper actually handles for you.

One model, many tasks

Florence-2 is a sequence-to-sequence transformer. The image is encoded to a sequence of visual tokens; the task is encoded as a short text prompt (<CAPTION>, <OCR>, <OD>, <CAPTION_TO_PHRASE_GROUNDING>, …); and the model generates a text output that contains either:

Plain text (a caption or transcribed OCR), or
Special location tokens (<loc_500><loc_200>…) interleaved with text, which describe bounding boxes, polygons, or quad-boxes.

Florence2-Sharp handles all of that for you. You pass a TaskTypes value, the library translates it to the right prompt, runs the model, parses the output, and returns a typed FlorenceResults with the right fields populated for that task.

The pieces of the wrapper

                       ┌────────────────────────────┐
                       │   Florence2Model            │
                       │                             │
   Image stream ─────► │  Image preprocessing        │
                       │  (CLIP image processor)     │
                       │                             │
   TaskTypes + text──► │  Prompt construction        │
                       │  Tokeniser                  │
                       │                             │
                       │  ONNX Runtime inference     │
                       │  (encoder + decoder)        │
                       │                             │
                       │  Logits processor / sampler │
                       │  Stopping criteria          │
                       │                             │
                       │  Post-processor             │ ──►  FlorenceResults
                       └────────────────────────────┘

You only ever construct two things:

FlorenceModelDownloader — owns the location of the ONNX model files. Either downloads them from Hugging Face on first use, or wraps a directory you've already populated.
Florence2Model — holds the ONNX session and the tokeniser. Construct once per process and share across requests.

Everything else — image normalisation, tokenisation, beam search, location-token decoding — happens inside Florence2Model.Run.

What `Run` returns

Every task produces a FlorenceResults:

public class FlorenceResults
{
    public LabeledOCRBox[]        OCRBBox;
    public string                 PureText;
    public LabeledBoundingBoxes[] BoundingBoxes;
    public LabeledPolygon[]       Polygons;
}

Different tasks populate different fields — CAPTION fills PureText, OD fills BoundingBoxes, OCR_WITH_REGION fills OCRBBox, and so on. The complete mapping is in Supported tasks.

Florence-2-base, large, and large-ft

Microsoft publishes three Florence-2 sizes on Hugging Face:

Checkpoint	Parameters	Quality	Speed
Florence-2-base	~230M	Good	Fastest
Florence-2-large	~770M	Better	Slower
Florence-2-large-ft	~770M	Best (instruction-finetuned)	Slower

FlorenceModelDownloader defaults to Florence-2-base. To use a larger checkpoint, download the ONNX files manually and point the downloader at the folder — see Managing the model cache.

The Florence-2 model

One model, many tasks

The pieces of the wrapper

What Run returns

Florence-2-base, large, and large-ft

Further reading

What `Run` returns