Florence2-Sharp

Captioning

Florence-2 generates three lengths of caption from the same image. They share an API but differ in verbosity — pick the one that matches the downstream consumer.

The three caption tasks

TaskTypes value Typical length Use for
CAPTION One short sentence. Alt text, thumbnail tooltips, terse summaries.
DETAILED_CAPTION Two or three sentences. Search snippets, list views, RAG context.
MORE_DETAILED_CAPTION A paragraph. Accessibility narration, indexing for full-text search.

All three return their result in FlorenceResults.PureText.

Example: detailed caption

using Florence2;

var modelSource = new FlorenceModelDownloader("./models");
await modelSource.DownloadModelsAsync();

var model = new Florence2Model(modelSource);

using var image = File.OpenRead("street.jpg");
var results = model.Run(TaskTypes.DETAILED_CAPTION, image);

Console.WriteLine(results.PureText);
// → "A narrow European street lined with stone buildings at sunset. A red bicycle
//    leans against the wall on the left, and two figures walk away from the camera."

Comparing the three lengths

A quick A/B/C on the same image:

using var image = File.OpenRead("street.jpg");

foreach (var task in new[]
{
    TaskTypes.CAPTION,
    TaskTypes.DETAILED_CAPTION,
    TaskTypes.MORE_DETAILED_CAPTION,
})
{
    image.Position = 0;     // rewind for each run
    Console.WriteLine($"--- {task} ---");
    Console.WriteLine(model.Run(task, image).PureText);
    Console.WriteLine();
}

The stream needs to be rewound between calls because Florence-2-Sharp reads the image fully each time.

Picking a length

  • Short images, terse contextsCAPTION. Cheaper to compute, smaller to store, easy to translate.
  • Search and RAGDETAILED_CAPTION. Captures enough specifics to be useful as a retrieval signal without producing noise.
  • Accessibility, archive cataloguingMORE_DETAILED_CAPTION. The model commits to specific attributes (colour, count, posture) — useful for human consumers, sometimes too speculative for downstream NLP.

A note on hallucinations

Florence-2 is a generative model. The longer caption tasks are more prone to confident-sounding hallucinations — colours, counts, and named entities the image doesn't actually show. If your downstream uses these captions as structured signal (e.g. "is there a person in this image?"), validate with OD or OPEN_VOCABULARY_DETECTION instead of trusting the prose.

Where next?

OCR

Read text from an image — plain or with regions.

Object detection

Get bounding boxes and labels for objects in the image.

© 2026 Florence2-Sharp. All rights reserved.