NVIDIA ·Computer Vision ·Detection

LocateAnything

Most AI says where an object is by spelling out its location one number at a time. LocateAnything draws the whole box in a single stroke.

June 6, 20268 min read

the core idea

One number at a time, or all at once.

Here is the old way and the new way, side by side. Press play and watch how differently they finish the very same box.

parallel_box_decoding

Old way: one at a time4 steps

x1...

y1...

x2...

y2...

LocateAnything: all at once1 step

...

corners: 0 / 4

Same box. One is patient typing, the other is a single snapshot.

three modes

Dial how careful it is.

You can choose how much it double-checks itself. Switch between the modes and watch the boxes behave differently.

inference_mode

Snap by default, then redraw only the shaky boxes carefully. This is the default.

your turn

Ask it to find something.

Pick a question, or type your own, and watch it locate the answer on this scene.

locate

speed and scale

Why it matters at scale.

Drawing boxes in one stroke is not just elegant, it is much faster, and the model learned from a striking amount of data.

Speedabout 10x faster

LocateAnything12.7

A comparable model1.1

boxes per second

What it learned from

12M

images

138M

questions it trained on

785M

boxes it learned to draw

how it works

A little more, if you are curious.

Plain-language, no math. Here is the idea behind LocateAnything and why it is both faster and more precise.

The problem with one number at a time

A box is four numbers: left, top, right, bottom. Most vision-language models spell them out one token at a time, and each number is predicted almost on its own. That fights the geometry, the four numbers describe a single rectangle and really belong together, and it is slow, because nothing finishes until the previous number has been written.

The idea: Parallel Box Decoding

LocateAnything treats a whole box (or a point) as a single atomic unit and produces all four corners in one step. Because the corners are decided together, the model keeps what the authors call intra-box geometric coherence: the rectangle stays sensible. You get two wins at once, far more parallelism (so it is much faster) and better high-IoU localization (the tight, precise boxes improve the most).

Learning two ways at once

During training the model practices both styles in parallel. One stream is the careful, one-at-a-time (next-token) style that preserves step-by-step reasoning; the other is the all-at-once (block) style that teaches it to emit a whole box together. They share what they see in the image but are otherwise kept apart, so the fast style cannot quietly lean on the slow one. An attention mask sets the rules: inside a single box the four corners may look at each other, while between boxes a box may look back at finished boxes but not at ones still to come.

Three modes, and how Hybrid decides

The same model offers three speeds. Slow draws every box number by number (highest quality). Fast snaps every box in one step (quickest). Hybrid snaps by default and only slows down for a box that looks shaky. A box counts as shaky when two things happen together: the model is not confident about a corner, and the likely positions are spread far apart, the tell-tale sign of a crowded, ambiguous area where a quick guess would probably be wrong.

For the technically curious: Hybrid falls back when top-1 confidence is below 0.7 and the top-5 spread is above 80.

What it learned from

LocateAnything-Data is huge: about 12M images, 138M natural-language queries, and 785M annotated boxes, across six domains, everyday object detection, GUI screenshots, referring expressions ("the red car"), text and OCR, document layout, and single points. Crucially it includes negative samples, queries with no matching object, which teach the model to answer "nothing here" instead of inventing a box.

Why it matters

Treating boxes as whole units pays off. A 3B-parameter LocateAnything beats much larger 32B specialists at GUI grounding, and it decodes around 12.7 boxes per second versus roughly 1.1 for a comparable model, about ten times faster, while improving the tight, high-overlap boxes. In short, it pushes the speed-accuracy frontier instead of trading one for the other.

Read the full paper on arXiv

run it yourself

How to actually run it.

Everything below is taken straight from the official NVIDIA repo, model card, and paper. Nothing here is invented. The API is a thin worker around the model.

Install

Clone the Eagle repo, move into the Embodied folder, and install in editable mode. The key pin is transformers==4.57.1 (accelerate, deepspeed, timm, and peft come along automatically).

bash

git clone https://github.com/NVlabs/Eagle.git eagle
cd eagle/Embodied
pip install -e .

Quick start

Load the model once into a worker, then call one task method per question. Each method returns a dict whose answer field is the raw string with box tokens.

python

import torch
from PIL import Image
from locateanything_worker import LocateAnythingWorker

# Loads nvidia/LocateAnything-3B in BF16 on CUDA
worker = LocateAnythingWorker("nvidia/LocateAnything-3B")
img = Image.open("example.jpg").convert("RGB")

# Object detection: pass a list of categories
print(worker.detect(img, ["person", "car", "bicycle"])["answer"])

# Phrase grounding: a free-form description
print(worker.ground_multi(img, "people wearing red shirts")["answer"])

# GUI grounding, returned as a point
print(worker.ground_gui(img, "the search button", output_type="point")["answer"])

# Pointing
print(worker.point(img, "the traffic light")["answer"])

# Decoding mode: "fast" (MTP), "slow" (NTP), or "hybrid" (default)
print(worker.detect(img, ["person"], generation_mode="fast")["answer"])

Every task method forwards generation_mode: fast (MTP) for maximum throughput, slow (NTP) for maximum stability, or hybrid (the default) which snaps fast and rechecks only the shaky boxes.

What comes back

The model speaks in special tokens. Coordinates are integers in [0, 1000], so they are resolution-independent: divide by 1000, then multiply by the image width or height to get pixels.

output

Bounding box:  <ref>label</ref><box><x1><y1><x2><y2></box>
Point:         <box><x><y></box>
No object:     <box>none</box>

# coordinates are integers in [0, 1000]

Turn tokens into pixels

A few lines of regex map the [0, 1000] integers onto your image. The worker also ships parse_boxes and parse_points, so you rarely write this yourself.

python

import re

def parse_boxes(answer, image_width, image_height):
    boxes = []
    pattern = r"<box><(\d+)><(\d+)><(\d+)><(\d+)></box>"
    for m in re.finditer(pattern, answer):
        x1, y1, x2, y2 = (int(g) for g in m.groups())
        boxes.append({
            "x1": x1 / 1000 * image_width,
            "y1": y1 / 1000 * image_height,
            "x2": x2 / 1000 * image_width,
            "y2": y2 / 1000 * image_height,
        })
    return boxes

Where it runs

A 3B model, so it is not demanding. Here is exactly what the repo and model card state about hardware. Numbers that are not published are marked as such, not guessed.

Component	What the repo says
GPU (recommended)	H100 (Hopper). The paper reports 12.7 boxes per second on a single H100.
GPU (supported)	A100, L40 and similar. Use --attn_implementation sdpa, which handles sequences up to about 4K tokens.
Long-context attention	MagiAttention, for 16K to 32K+ token sequences. It supports Hopper and Blackwell GPUs only.
Precision	BF16. The worker loads the model with dtype=torch.bfloat16 by default.
Core dependency	transformers==4.57.1, plus accelerate, deepspeed, timm, and peft (auto-installed).
Model	nvidia/LocateAnything-3B: a Qwen2.5 decoder and a Moon-ViT vision encoder, about 3B parameters.
License	Code is Apache 2.0. Model weights are under the NVIDIA License (see LICENSE_MODEL).

The repo does not publish an exact VRAM figure. As a 3B model in BF16 it fits comfortably on a single modern data-center GPU; the real memory cost depends on image resolution and sequence length.

Go to the source

HF model Live demo GitHub Paper

the takeaway

Stop spelling out boxes. Start snapping them.

That is the whole idea. Treat a box as one thing, draw it in one move, and only slow down when you actually need to.