Phrase grounding

Phrase grounding answers the question "where in this image is the thing I just described?" Florence-2 supports it via TaskTypes.CAPTION_TO_PHRASE_GROUNDING — pass a phrase as textInput and the model returns bounding boxes for the regions that match.

This is closely related to open-vocabulary detection. The difference is one of intent:

Open-vocabulary detection treats the prompt as a class name — "find every motorcycle."
Phrase grounding treats the prompt as a referring expression — "the red bicycle leaning on the wall."

In practice both tasks accept similar prompts. Grounding tends to be better when the description is about a specific instance referenced by attributes ("the largest building"); detection is better for generic categories ("every building").

Example

using Florence2;

var modelSource = new FlorenceModelDownloader("./models");
await modelSource.DownloadModelsAsync();

var model = new Florence2Model(modelSource);

using var image = File.OpenRead("street.jpg");

var results = model.Run(
    TaskTypes.CAPTION_TO_PHRASE_GROUNDING,
    image,
    textInput: "the red bicycle leaning against the wall");

foreach (var entry in results.BoundingBoxes)
{
    foreach (var box in entry.BBoxes)
    {
        Console.WriteLine($"\"{entry.Label}\" → ({box.xmin:F0}, {box.ymin:F0}) → ({box.xmax:F0}, {box.ymax:F0})");
    }
}

The result lives in FlorenceResults.BoundingBoxes — same shape as object detection. The Label typically echoes (a tokenisation of) the input phrase.

Combining grounding with other tasks

Grounding pairs naturally with region-level follow-up tasks. A common pipeline:

Ground the phrase → bounding box.
Encode the bounding box as <loc_…> tokens via BoxQuantizer.
Run REGION_TO_OCR, REGION_TO_DESCRIPTION, or REGION_TO_SEGMENTATION on the grounded region.

// 1. Ground
using var image = File.OpenRead("invoice.jpg");
var grounding = model.Run(
    TaskTypes.CAPTION_TO_PHRASE_GROUNDING,
    image,
    textInput: "the total amount");

var box = grounding.BoundingBoxes[0].BBoxes[0];

// 2. Encode the box as <loc_xxx><loc_yyy><loc_xxx><loc_yyy> in 0..999 space.
//    Florence-2 normalises coordinates by image size before quantising to 1000 bins.
int w = imageWidth, h = imageHeight;
string locTokens =
    $"<loc_{(int)(box.xmin * 999 / w)}>" +
    $"<loc_{(int)(box.ymin * 999 / h)}>" +
    $"<loc_{(int)(box.xmax * 999 / w)}>" +
    $"<loc_{(int)(box.ymax * 999 / h)}>";

// 3. OCR just that region
image.Position = 0;
var ocr = model.Run(TaskTypes.REGION_TO_OCR, image, textInput: locTokens);
Console.WriteLine(ocr.PureText);

This is the standard "find then read" pattern for structured document extraction.

Tips

Be specific. "The red bicycle" usually works; "bicycle" sometimes returns multiple boxes that aren't useful for grounding. Add discriminating attributes.
Watch for empty results. Florence-2 can return zero boxes if the phrase doesn't match anything in the image. Check BoundingBoxes.Length before indexing.
Phrases vs. sentences. Short noun phrases work better than full sentences. "A man wearing a hat" beats "There is a man in the image who is wearing a hat."

When `REFERRING_EXPRESSION_SEGMENTATION` would be ideal — but isn't

For polygon-level grounding (e.g. "the silhouette of the dog, pixel-accurate") REFERRING_EXPRESSION_SEGMENTATION is the natural choice. It's enumerated in TaskTypes but the model currently produces malformed location tokens for that task — the wrapper exposes the field, but the output is unreliable.

As a workaround, ground first with CAPTION_TO_PHRASE_GROUNDING, then call REGION_TO_SEGMENTATION on the resulting bounding box.

Phrase grounding

Example

Combining grounding with other tasks

Tips

When REFERRING_EXPRESSION_SEGMENTATION would be ideal — but isn't

When `REFERRING_EXPRESSION_SEGMENTATION` would be ideal — but isn't