Florence2-Sharp

OCR

Florence-2 reads text from images in two modes — bulk plain-text extraction and per-region OCR with quad-box coordinates. Pick the one that matches how you'll consume the result downstream.

Plain OCR

TaskTypes.OCR returns every recognised word as a single concatenated string in FlorenceResults.PureText. There is no per-region structure — useful for documents or screenshots when you just want the text content.

using var image = File.OpenRead("receipt.jpg");
var results = model.Run(TaskTypes.OCR, image);

Console.WriteLine(results.PureText);
// → "GREENGROCER 12 OAK LANE total: 14.32"

OCR with regions

TaskTypes.OCR_WITH_REGION returns the same text plus a quad-box (four corner points) around each recognised region — useful for highlighting, click-targeting, or downstream layout-aware parsing.

using var image = File.OpenRead("receipt.jpg");
var results = model.Run(TaskTypes.OCR_WITH_REGION, image);

foreach (var box in results.OCRBBox)
{
    Console.WriteLine($"\"{box.Text}\"");
    foreach (var corner in box.QuadBox)
        Console.WriteLine($"    ({corner.x:F0}, {corner.y:F0})");
}

OCRBBox is an array of LabeledOCRBox. Each element has:

  • Text — the recognised text in that region.
  • QuadBox — four Coordinates<float> points, clockwise from the top-left, in pixel coordinates of the original image.

The quad-box is a quadrilateral, not an axis-aligned rectangle — it follows rotated or skewed text.

OCR a specific region only

If you already know where the text is (from a manual annotation, an object detection step, or a UI hit-test), use TaskTypes.REGION_TO_OCR to read text from just that area. Pass the region in textInput using <loc_…> tokens.

// Region encoded as <loc_x1><loc_y1><loc_x2><loc_y2> in 0..999 space
string region = "<loc_120><loc_400><loc_780><loc_460>";

using var image = File.OpenRead("receipt.jpg");
var results = model.Run(TaskTypes.REGION_TO_OCR, image, textInput: region);

Console.WriteLine(results.PureText);

Florence-2 quantises image coordinates into a 1000-bin grid (0–999), so you need to map pixel coordinates by the image dimensions before formatting the <loc_…> tokens. The library's BoxQuantizer exposes the quantisation; emit the tokens directly with string formatting once you have the bin indices.

Tips

  • Pre-rotate scanned documents. Florence-2 handles small skew well, but heavily rotated text (90° or more) confuses it. Detect orientation first if you can.
  • Resolution matters. Florence-2-base preprocesses images to 768×768. Tiny text becomes unreadable after the resize. For dense documents, consider tiling and running OCR per tile.
  • Numbers and codes can drift. Generative OCR sometimes "corrects" unfamiliar product codes or numbers towards more common patterns. For structured fields (SKUs, invoice IDs), validate with a regex or a check-digit pass.

Where next?

Object detection

Detect what's in the image, not just the text.

Phrase grounding

Locate a region matching a natural-language description.

Referenced by

© 2026 Florence2-Sharp. All rights reserved.