Phrase grounding
Phrase grounding answers the question "where in this image is the thing I just described?" Florence-2 supports it via TaskTypes.CAPTION_TO_PHRASE_GROUNDING — pass a phrase as textInput and the model returns bounding boxes for the regions that match.
This is closely related to open-vocabulary detection. The difference is one of intent:
- Open-vocabulary detection treats the prompt as a class name — "find every motorcycle."
- Phrase grounding treats the prompt as a referring expression — "the red bicycle leaning on the wall."
In practice both tasks accept similar prompts. Grounding tends to be better when the description is about a specific instance referenced by attributes ("the largest building"); detection is better for generic categories ("every building").
Example
using Florence2;
var modelSource = new FlorenceModelDownloader("./models");
await modelSource.DownloadModelsAsync();
var model = new Florence2Model(modelSource);
using var image = File.OpenRead("street.jpg");
var results = model.Run(
TaskTypes.CAPTION_TO_PHRASE_GROUNDING,
image,
textInput: "the red bicycle leaning against the wall");
foreach (var entry in results.BoundingBoxes)
{
foreach (var box in entry.BBoxes)
{
Console.WriteLine($"\"{entry.Label}\" → ({box.xmin:F0}, {box.ymin:F0}) → ({box.xmax:F0}, {box.ymax:F0})");
}
}
The result lives in FlorenceResults.BoundingBoxes — same shape as object detection. The Label typically echoes (a tokenisation of) the input phrase.
Combining grounding with other tasks
Grounding pairs naturally with region-level follow-up tasks. A common pipeline:
- Ground the phrase → bounding box.
- Encode the bounding box as
<loc_…>tokens viaBoxQuantizer. - Run
REGION_TO_OCR,REGION_TO_DESCRIPTION, orREGION_TO_SEGMENTATIONon the grounded region.
// 1. Ground
using var image = File.OpenRead("invoice.jpg");
var grounding = model.Run(
TaskTypes.CAPTION_TO_PHRASE_GROUNDING,
image,
textInput: "the total amount");
var box = grounding.BoundingBoxes[0].BBoxes[0];
// 2. Encode the box as <loc_xxx><loc_yyy><loc_xxx><loc_yyy> in 0..999 space.
// Florence-2 normalises coordinates by image size before quantising to 1000 bins.
int w = imageWidth, h = imageHeight;
string locTokens =
$"<loc_{(int)(box.xmin * 999 / w)}>" +
$"<loc_{(int)(box.ymin * 999 / h)}>" +
$"<loc_{(int)(box.xmax * 999 / w)}>" +
$"<loc_{(int)(box.ymax * 999 / h)}>";
// 3. OCR just that region
image.Position = 0;
var ocr = model.Run(TaskTypes.REGION_TO_OCR, image, textInput: locTokens);
Console.WriteLine(ocr.PureText);
This is the standard "find then read" pattern for structured document extraction.
Tips
- Be specific. "The red bicycle" usually works; "bicycle" sometimes returns multiple boxes that aren't useful for grounding. Add discriminating attributes.
- Watch for empty results. Florence-2 can return zero boxes if the phrase doesn't match anything in the image. Check
BoundingBoxes.Lengthbefore indexing. - Phrases vs. sentences. Short noun phrases work better than full sentences. "A man wearing a hat" beats "There is a man in the image who is wearing a hat."
When REFERRING_EXPRESSION_SEGMENTATION would be ideal — but isn't
For polygon-level grounding (e.g. "the silhouette of the dog, pixel-accurate") REFERRING_EXPRESSION_SEGMENTATION is the natural choice. It's enumerated in TaskTypes but the model currently produces malformed location tokens for that task — the wrapper exposes the field, but the output is unreliable.
As a workaround, ground first with CAPTION_TO_PHRASE_GROUNDING, then call REGION_TO_SEGMENTATION on the resulting bounding box.