Captioning

Florence-2 generates three lengths of caption from the same image. They share an API but differ in verbosity — pick the one that matches the downstream consumer.

The three caption tasks

`TaskTypes` value	Typical length	Use for
`CAPTION`	One short sentence.	Alt text, thumbnail tooltips, terse summaries.
`DETAILED_CAPTION`	Two or three sentences.	Search snippets, list views, RAG context.
`MORE_DETAILED_CAPTION`	A paragraph.	Accessibility narration, indexing for full-text search.

All three return their result in FlorenceResults.PureText.

Example: detailed caption

using Florence2;

var modelSource = new FlorenceModelDownloader("./models");
await modelSource.DownloadModelsAsync();

var model = new Florence2Model(modelSource);

using var image = File.OpenRead("street.jpg");
var results = model.Run(TaskTypes.DETAILED_CAPTION, image);

Console.WriteLine(results.PureText);
// → "A narrow European street lined with stone buildings at sunset. A red bicycle
//    leans against the wall on the left, and two figures walk away from the camera."

Comparing the three lengths

A quick A/B/C on the same image:

using var image = File.OpenRead("street.jpg");

foreach (var task in new[]
{
    TaskTypes.CAPTION,
    TaskTypes.DETAILED_CAPTION,
    TaskTypes.MORE_DETAILED_CAPTION,
})
{
    image.Position = 0;     // rewind for each run
    Console.WriteLine($"--- {task} ---");
    Console.WriteLine(model.Run(task, image).PureText);
    Console.WriteLine();
}

The stream needs to be rewound between calls because Florence-2-Sharp reads the image fully each time.

Picking a length

Short images, terse contexts — CAPTION. Cheaper to compute, smaller to store, easy to translate.
Search and RAG — DETAILED_CAPTION. Captures enough specifics to be useful as a retrieval signal without producing noise.
Accessibility, archive cataloguing — MORE_DETAILED_CAPTION. The model commits to specific attributes (colour, count, posture) — useful for human consumers, sometimes too speculative for downstream NLP.

A note on hallucinations

Florence-2 is a generative model. The longer caption tasks are more prone to confident-sounding hallucinations — colours, counts, and named entities the image doesn't actually show. If your downstream uses these captions as structured signal (e.g. "is there a person in this image?"), validate with OD or OPEN_VOCABULARY_DETECTION instead of trusting the prose.