Captioning
Florence-2 generates three lengths of caption from the same image. They share an API but differ in verbosity — pick the one that matches the downstream consumer.
The three caption tasks
TaskTypes value |
Typical length | Use for |
|---|---|---|
CAPTION |
One short sentence. | Alt text, thumbnail tooltips, terse summaries. |
DETAILED_CAPTION |
Two or three sentences. | Search snippets, list views, RAG context. |
MORE_DETAILED_CAPTION |
A paragraph. | Accessibility narration, indexing for full-text search. |
All three return their result in FlorenceResults.PureText.
Example: detailed caption
using Florence2;
var modelSource = new FlorenceModelDownloader("./models");
await modelSource.DownloadModelsAsync();
var model = new Florence2Model(modelSource);
using var image = File.OpenRead("street.jpg");
var results = model.Run(TaskTypes.DETAILED_CAPTION, image);
Console.WriteLine(results.PureText);
// → "A narrow European street lined with stone buildings at sunset. A red bicycle
// leans against the wall on the left, and two figures walk away from the camera."
Comparing the three lengths
A quick A/B/C on the same image:
using var image = File.OpenRead("street.jpg");
foreach (var task in new[]
{
TaskTypes.CAPTION,
TaskTypes.DETAILED_CAPTION,
TaskTypes.MORE_DETAILED_CAPTION,
})
{
image.Position = 0; // rewind for each run
Console.WriteLine($"--- {task} ---");
Console.WriteLine(model.Run(task, image).PureText);
Console.WriteLine();
}
The stream needs to be rewound between calls because Florence-2-Sharp reads the image fully each time.
Picking a length
- Short images, terse contexts —
CAPTION. Cheaper to compute, smaller to store, easy to translate. - Search and RAG —
DETAILED_CAPTION. Captures enough specifics to be useful as a retrieval signal without producing noise. - Accessibility, archive cataloguing —
MORE_DETAILED_CAPTION. The model commits to specific attributes (colour, count, posture) — useful for human consumers, sometimes too speculative for downstream NLP.
A note on hallucinations
Florence-2 is a generative model. The longer caption tasks are more prone to confident-sounding hallucinations — colours, counts, and named entities the image doesn't actually show. If your downstream uses these captions as structured signal (e.g. "is there a person in this image?"), validate with OD or OPEN_VOCABULARY_DETECTION instead of trusting the prose.