Performance

Florence-2 is a sequence-to-sequence transformer. Latency is dominated by the decoder's autoregressive token loop — the longer the output, the slower the call. This page covers the levers that matter most.

Pick the right execution provider

Florence2.dll depends on ONNX Runtime, which selects an execution provider at session creation. The default is CPU; for production workloads, GPU is usually worth the switch.

NuGet package	Provider	Where it runs
`Microsoft.ML.OnnxRuntime` (default)	CPU	Anywhere.
`Microsoft.ML.OnnxRuntime.Gpu`	CUDA	NVIDIA GPUs on Windows / Linux.
`Microsoft.ML.OnnxRuntime.DirectML`	DirectML	Any DirectX 12 GPU on Windows.
`Microsoft.ML.OnnxRuntime.CoreML`	CoreML	Apple Silicon.

Install the GPU package alongside Florence2:

dotnet add package Florence2
dotnet add package Microsoft.ML.OnnxRuntime.Gpu

ONNX Runtime auto-selects the best available provider — no code changes in Florence2-Sharp.

Typical latency for DETAILED_CAPTION against a 768×768 image (rough order of magnitude):

CPU (modern Xeon / Ryzen) — 1–3 seconds
CUDA (mid-range NVIDIA) — 100–300 ms
DirectML (recent integrated GPU) — 200–500 ms

Treat those as ballpark figures — concrete numbers depend on hardware, batch size, and which task you're running.

Re-use `Florence2Model`

The constructor loads the ONNX session and tokeniser — both expensive (hundreds of milliseconds, sometimes seconds with GPU init). Construct once per process and share across requests.

app/DependencyInjection.cs

services.AddSingleton<Florence2Model>(sp =>
{
    var downloader = new FlorenceModelDownloader("./models");
    downloader.DownloadModelsAsync().GetAwaiter().GetResult();
    return new Florence2Model(downloader);
});

For concurrent inference, the model is safe to share across threads. ONNX Runtime serialises within a single session, so multiple concurrent calls don't actually parallelise on CPU — but they queue correctly without crashing.

For real parallelism on multi-GPU boxes or multi-core CPU servers, allocate multiple sessions and route requests to them via a worker pool.

Pick the smallest task that answers your question

The autoregressive decoder runs once per output token. Shorter outputs → faster calls.

Task	Typical output length	Relative latency
`CAPTION`	~10–20 tokens	1×
`DETAILED_CAPTION`	~30–50 tokens	~2–3×
`MORE_DETAILED_CAPTION`	~100–200 tokens	~5–10×
`OD`	depends on object count	2–5×
`OCR`	depends on text density	2–10×

If you don't need the long-form caption, don't ask for it.

Image sizing

Florence-2's image preprocessor resizes everything to 768×768 — the model's native input size. Sending in a 4K image doesn't improve quality and adds JPEG decode + resize cost.

For high-throughput pipelines, pre-resize images to 768 px on the long edge before handing them to the library. The wrapper accepts any common raster format via Stream — pre-encode once, pass the stream.

Reuse the input stream

Florence2Model.Run reads the entire image stream for each call. If you call it multiple times for the same image (a common pattern when running several tasks side by side), you can:

Rewind the stream between calls (stream.Position = 0), or
Use a MemoryStream so the rewind is free.

byte[] imageBytes = File.ReadAllBytes("photo.jpg");

foreach (var task in tasks)
{
    using var ms = new MemoryStream(imageBytes, writable: false);
    var result = model.Run(task, ms);
    // ...
}

Common pitfalls

ONNX Runtime versions matter

ONNX Runtime is fast-moving and not all versions are GPU-compatible with all CUDA versions. Pin the Microsoft.ML.OnnxRuntime.Gpu version against your CUDA / driver version — mismatches fail at session creation.

Warm the model on startup

The first call after new Florence2Model(...) includes JIT, kernel selection, and GPU memory allocation. For latency-sensitive services, run a throwaway inference on a placeholder image during startup so the first real request doesn't pay the warm-up cost.

Measuring before tuning

Always measure before guessing. A quick harness:

var sw = System.Diagnostics.Stopwatch.StartNew();
for (int i = 0; i < 10; i++)
{
    using var img = File.OpenRead("photo.jpg");
    model.Run(TaskTypes.CAPTION, img);
}
sw.Stop();
Console.WriteLine($"Avg: {sw.ElapsedMilliseconds / 10.0:F0} ms/call");

Different tasks, image sizes, and providers all need their own numbers — there's no single "Florence2 is X ms" answer.