OCR

Florence-2 reads text from images in two modes — bulk plain-text extraction and per-region OCR with quad-box coordinates. Pick the one that matches how you'll consume the result downstream.

Plain OCR

TaskTypes.OCR returns every recognised word as a single concatenated string in FlorenceResults.PureText. There is no per-region structure — useful for documents or screenshots when you just want the text content.

using var image = File.OpenRead("receipt.jpg");
var results = model.Run(TaskTypes.OCR, image);

Console.WriteLine(results.PureText);
// → "GREENGROCER 12 OAK LANE total: 14.32"

OCR with regions

TaskTypes.OCR_WITH_REGION returns the same text plus a quad-box (four corner points) around each recognised region — useful for highlighting, click-targeting, or downstream layout-aware parsing.

using var image = File.OpenRead("receipt.jpg");
var results = model.Run(TaskTypes.OCR_WITH_REGION, image);

foreach (var box in results.OCRBBox)
{
    Console.WriteLine($"\"{box.Text}\"");
    foreach (var corner in box.QuadBox)
        Console.WriteLine($"    ({corner.x:F0}, {corner.y:F0})");
}

OCRBBox is an array of LabeledOCRBox. Each element has:

Text — the recognised text in that region.
QuadBox — four Coordinates<float> points, clockwise from the top-left, in pixel coordinates of the original image.

The quad-box is a quadrilateral, not an axis-aligned rectangle — it follows rotated or skewed text.

OCR a specific region only

If you already know where the text is (from a manual annotation, an object detection step, or a UI hit-test), use TaskTypes.REGION_TO_OCR to read text from just that area. Pass the region in textInput using <loc_…> tokens.

// Region encoded as <loc_x1><loc_y1><loc_x2><loc_y2> in 0..999 space
string region = "<loc_120><loc_400><loc_780><loc_460>";

using var image = File.OpenRead("receipt.jpg");
var results = model.Run(TaskTypes.REGION_TO_OCR, image, textInput: region);

Console.WriteLine(results.PureText);

Florence-2 quantises image coordinates into a 1000-bin grid (0–999), so you need to map pixel coordinates by the image dimensions before formatting the <loc_…> tokens. The library's BoxQuantizer exposes the quantisation; emit the tokens directly with string formatting once you have the bin indices.

Tips

Pre-rotate scanned documents. Florence-2 handles small skew well, but heavily rotated text (90° or more) confuses it. Detect orientation first if you can.
Resolution matters. Florence-2-base preprocesses images to 768×768. Tiny text becomes unreadable after the resize. For dense documents, consider tiling and running OCR per tile.
Numbers and codes can drift. Generative OCR sometimes "corrects" unfamiliar product codes or numbers towards more common patterns. For structured fields (SKUs, invoice IDs), validate with a regex or a check-digit pass.

OCR

Plain OCR

OCR with regions

OCR a specific region only

Tips

Where next?

Object detection

Phrase grounding