Quick Start

Five steps from dotnet add to a generated image caption: download the model, build a Florence2Model, point it at an image, choose a task, and read the result.

If you haven't installed the package yet, see Installation:

dotnet add package Florence2

1. Download the model

FlorenceModelDownloader fetches the Florence-2-base ONNX checkpoints from Hugging Face on first use and caches them on disk. Subsequent runs reuse the cache.

using Florence2;

var modelSource = new FlorenceModelDownloader("./models");
await modelSource.DownloadModelsAsync();

You can also point the downloader at a pre-populated directory containing Florence-2-base ONNX models — see Managing the model cache.

2. Build the model

var model = new Florence2Model(modelSource);

Florence2Model is thread-safe for reads — share a single instance across requests. Allocate one per process, not one per call.

3. Load an image

The library accepts any seekable Stream containing a common raster format (PNG, JPEG, BMP, …). For an on-disk file:

using var image = File.OpenRead("car.jpg");

For an in-memory image, wrap your byte[] in a MemoryStream.

4. Pick a task and run inference

Run takes a TaskTypes value, the image stream, and — for grounding tasks only — a phrase. It returns a typed FlorenceResults you can read fields off directly.

var results = model.Run(TaskTypes.DETAILED_CAPTION, image);

Console.WriteLine(results.PureText);
// → "A red sedan parked on a cobblestone street under late-afternoon light."

For tasks that produce structured output — OCR, object detection, grounding — read the appropriate field instead of PureText. See Supported tasks for the full mapping.

5. Try a few tasks side by side

app/Program.cs

using Florence2;
using System.Text.Json;

var modelSource = new FlorenceModelDownloader("./models");
await modelSource.DownloadModelsAsync();

var model = new Florence2Model(modelSource);

foreach (var task in new[]
{
    TaskTypes.CAPTION,
    TaskTypes.DETAILED_CAPTION,
    TaskTypes.OCR,
    TaskTypes.OD,
})
{
    using var image = File.OpenRead("car.jpg");
    var results = model.Run(task, image);

    Console.WriteLine($"--- {task} ---");
    Console.WriteLine(JsonSerializer.Serialize(results, new JsonSerializerOptions { WriteIndented = true }));
}

Run that against any of your own images and Florence-2 will produce a caption, transcribed text, and bounding boxes — all from the same model.

Where next?

Core Concepts

The Florence-2 model itself, the supported task family, and how prompts map to outputs.

Guides

Captioning, OCR, object detection, and phrase grounding — task by task with full code.

Advanced

Managing the model cache, GPU execution providers, and performance tuning.