Florence2-Sharp

Phrase grounding

Phrase grounding answers the question "where in this image is the thing I just described?" Florence-2 supports it via TaskTypes.CAPTION_TO_PHRASE_GROUNDING — pass a phrase as textInput and the model returns bounding boxes for the regions that match.

This is closely related to open-vocabulary detection. The difference is one of intent:

  • Open-vocabulary detection treats the prompt as a class name — "find every motorcycle."
  • Phrase grounding treats the prompt as a referring expression — "the red bicycle leaning on the wall."

In practice both tasks accept similar prompts. Grounding tends to be better when the description is about a specific instance referenced by attributes ("the largest building"); detection is better for generic categories ("every building").

Example

using Florence2;

var modelSource = new FlorenceModelDownloader("./models");
await modelSource.DownloadModelsAsync();

var model = new Florence2Model(modelSource);

using var image = File.OpenRead("street.jpg");

var results = model.Run(
    TaskTypes.CAPTION_TO_PHRASE_GROUNDING,
    image,
    textInput: "the red bicycle leaning against the wall");

foreach (var entry in results.BoundingBoxes)
{
    foreach (var box in entry.BBoxes)
    {
        Console.WriteLine($"\"{entry.Label}\" → ({box.xmin:F0}, {box.ymin:F0}) → ({box.xmax:F0}, {box.ymax:F0})");
    }
}

The result lives in FlorenceResults.BoundingBoxes — same shape as object detection. The Label typically echoes (a tokenisation of) the input phrase.

Combining grounding with other tasks

Grounding pairs naturally with region-level follow-up tasks. A common pipeline:

  1. Ground the phrase → bounding box.
  2. Encode the bounding box as <loc_…> tokens via BoxQuantizer.
  3. Run REGION_TO_OCR, REGION_TO_DESCRIPTION, or REGION_TO_SEGMENTATION on the grounded region.
// 1. Ground
using var image = File.OpenRead("invoice.jpg");
var grounding = model.Run(
    TaskTypes.CAPTION_TO_PHRASE_GROUNDING,
    image,
    textInput: "the total amount");

var box = grounding.BoundingBoxes[0].BBoxes[0];

// 2. Encode the box as <loc_xxx><loc_yyy><loc_xxx><loc_yyy> in 0..999 space.
//    Florence-2 normalises coordinates by image size before quantising to 1000 bins.
int w = imageWidth, h = imageHeight;
string locTokens =
    $"<loc_{(int)(box.xmin * 999 / w)}>" +
    $"<loc_{(int)(box.ymin * 999 / h)}>" +
    $"<loc_{(int)(box.xmax * 999 / w)}>" +
    $"<loc_{(int)(box.ymax * 999 / h)}>";

// 3. OCR just that region
image.Position = 0;
var ocr = model.Run(TaskTypes.REGION_TO_OCR, image, textInput: locTokens);
Console.WriteLine(ocr.PureText);

This is the standard "find then read" pattern for structured document extraction.

Tips

  • Be specific. "The red bicycle" usually works; "bicycle" sometimes returns multiple boxes that aren't useful for grounding. Add discriminating attributes.
  • Watch for empty results. Florence-2 can return zero boxes if the phrase doesn't match anything in the image. Check BoundingBoxes.Length before indexing.
  • Phrases vs. sentences. Short noun phrases work better than full sentences. "A man wearing a hat" beats "There is a man in the image who is wearing a hat."

When REFERRING_EXPRESSION_SEGMENTATION would be ideal — but isn't

For polygon-level grounding (e.g. "the silhouette of the dog, pixel-accurate") REFERRING_EXPRESSION_SEGMENTATION is the natural choice. It's enumerated in TaskTypes but the model currently produces malformed location tokens for that task — the wrapper exposes the field, but the output is unreliable.

As a workaround, ground first with CAPTION_TO_PHRASE_GROUNDING, then call REGION_TO_SEGMENTATION on the resulting bounding box.

© 2026 Florence2-Sharp. All rights reserved.