The Florence-2 model
Florence-2 is Microsoft Research's unified vision-language foundation model. A single model checkpoint handles captioning, OCR, object detection, dense region captioning, phrase grounding, and segmentation — tasks that traditionally required one specialised model each.
This page sketches what Florence-2 does, how it works in broad strokes, and which parts the C# wrapper actually handles for you.
One model, many tasks
Florence-2 is a sequence-to-sequence transformer. The image is encoded to a sequence of visual tokens; the task is encoded as a short text prompt (<CAPTION>, <OCR>, <OD>, <CAPTION_TO_PHRASE_GROUNDING>, …); and the model generates a text output that contains either:
- Plain text (a caption or transcribed OCR), or
- Special location tokens (
<loc_500><loc_200>…) interleaved with text, which describe bounding boxes, polygons, or quad-boxes.
Florence2-Sharp handles all of that for you. You pass a TaskTypes value, the library translates it to the right prompt, runs the model, parses the output, and returns a typed FlorenceResults with the right fields populated for that task.
The pieces of the wrapper
┌────────────────────────────┐
│ Florence2Model │
│ │
Image stream ─────► │ Image preprocessing │
│ (CLIP image processor) │
│ │
TaskTypes + text──► │ Prompt construction │
│ Tokeniser │
│ │
│ ONNX Runtime inference │
│ (encoder + decoder) │
│ │
│ Logits processor / sampler │
│ Stopping criteria │
│ │
│ Post-processor │ ──► FlorenceResults
└────────────────────────────┘
You only ever construct two things:
FlorenceModelDownloader— owns the location of the ONNX model files. Either downloads them from Hugging Face on first use, or wraps a directory you've already populated.Florence2Model— holds the ONNX session and the tokeniser. Construct once per process and share across requests.
Everything else — image normalisation, tokenisation, beam search, location-token decoding — happens inside Florence2Model.Run.
What Run returns
Every task produces a FlorenceResults:
public class FlorenceResults
{
public LabeledOCRBox[] OCRBBox;
public string PureText;
public LabeledBoundingBoxes[] BoundingBoxes;
public LabeledPolygon[] Polygons;
}
Different tasks populate different fields — CAPTION fills PureText, OD fills BoundingBoxes, OCR_WITH_REGION fills OCRBBox, and so on. The complete mapping is in Supported tasks.
Florence-2-base, large, and large-ft
Microsoft publishes three Florence-2 sizes on Hugging Face:
| Checkpoint | Parameters | Quality | Speed |
|---|---|---|---|
| Florence-2-base | ~230M | Good | Fastest |
| Florence-2-large | ~770M | Better | Slower |
| Florence-2-large-ft | ~770M | Best (instruction-finetuned) | Slower |
FlorenceModelDownloader defaults to Florence-2-base. To use a larger checkpoint, download the ONNX files manually and point the downloader at the folder — see Managing the model cache.
Further reading
- Florence-2 paper — "Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks".
- Florence-2 on Hugging Face — model card and sample Python code.
- ONNX Runtime — the inference engine this library uses.