LocateAnything
Most AI says where an object is by spelling out its location one number at a time. LocateAnything draws the whole box in a single stroke.
the core idea
One number at a time, or all at once.
Here is the old way and the new way, side by side. Press play and watch how differently they finish the very same box.
three modes
Dial how careful it is.
You can choose how much it double-checks itself. Switch between the modes and watch the boxes behave differently.
your turn
Ask it to find something.
Pick a question, or type your own, and watch it locate the answer on this scene.
speed and scale
Why it matters at scale.
Drawing boxes in one stroke is not just elegant, it is much faster, and the model learned from a striking amount of data.
boxes per second
images
questions it trained on
boxes it learned to draw
how it works
A little more, if you are curious.
Plain-language, no math. Here is the idea behind LocateAnything and why it is both faster and more precise.
The problem with one number at a time
A box is four numbers: left, top, right, bottom. Most vision-language models spell them out one token at a time, and each number is predicted almost on its own. That fights the geometry, the four numbers describe a single rectangle and really belong together, and it is slow, because nothing finishes until the previous number has been written.
The idea: Parallel Box Decoding
LocateAnything treats a whole box (or a point) as a single atomic unit and produces all four corners in one step. Because the corners are decided together, the model keeps what the authors call intra-box geometric coherence: the rectangle stays sensible. You get two wins at once, far more parallelism (so it is much faster) and better high-IoU localization (the tight, precise boxes improve the most).
Learning two ways at once
During training the model practices both styles in parallel. One stream is the careful, one-at-a-time (next-token) style that preserves step-by-step reasoning; the other is the all-at-once (block) style that teaches it to emit a whole box together. They share what they see in the image but are otherwise kept apart, so the fast style cannot quietly lean on the slow one. An attention mask sets the rules: inside a single box the four corners may look at each other, while between boxes a box may look back at finished boxes but not at ones still to come.
Three modes, and how Hybrid decides
The same model offers three speeds. Slow draws every box number by number (highest quality). Fast snaps every box in one step (quickest). Hybrid snaps by default and only slows down for a box that looks shaky. A box counts as shaky when two things happen together: the model is not confident about a corner, and the likely positions are spread far apart, the tell-tale sign of a crowded, ambiguous area where a quick guess would probably be wrong.
For the technically curious: Hybrid falls back when top-1 confidence is below 0.7 and the top-5 spread is above 80.
What it learned from
LocateAnything-Data is huge: about 12M images, 138M natural-language queries, and 785M annotated boxes, across six domains, everyday object detection, GUI screenshots, referring expressions ("the red car"), text and OCR, document layout, and single points. Crucially it includes negative samples, queries with no matching object, which teach the model to answer "nothing here" instead of inventing a box.
Why it matters
Treating boxes as whole units pays off. A 3B-parameter LocateAnything beats much larger 32B specialists at GUI grounding, and it decodes around 12.7 boxes per second versus roughly 1.1 for a comparable model, about ten times faster, while improving the tight, high-overlap boxes. In short, it pushes the speed-accuracy frontier instead of trading one for the other.
run it yourself
How to actually run it.
Everything below is taken straight from the official NVIDIA repo, model card, and paper. Nothing here is invented. The API is a thin worker around the model.
Clone the Eagle repo, move into the Embodied folder, and install in editable mode. The key pin is transformers==4.57.1 (accelerate, deepspeed, timm, and peft come along automatically).
git clone https://github.com/NVlabs/Eagle.git eagle
cd eagle/Embodied
pip install -e .Load the model once into a worker, then call one task method per question. Each method returns a dict whose answer field is the raw string with box tokens.
import torch
from PIL import Image
from locateanything_worker import LocateAnythingWorker
# Loads nvidia/LocateAnything-3B in BF16 on CUDA
worker = LocateAnythingWorker("nvidia/LocateAnything-3B")
img = Image.open("example.jpg").convert("RGB")
# Object detection: pass a list of categories
print(worker.detect(img, ["person", "car", "bicycle"])["answer"])
# Phrase grounding: a free-form description
print(worker.ground_multi(img, "people wearing red shirts")["answer"])
# GUI grounding, returned as a point
print(worker.ground_gui(img, "the search button", output_type="point")["answer"])
# Pointing
print(worker.point(img, "the traffic light")["answer"])
# Decoding mode: "fast" (MTP), "slow" (NTP), or "hybrid" (default)
print(worker.detect(img, ["person"], generation_mode="fast")["answer"])Every task method forwards generation_mode: fast (MTP) for maximum throughput, slow (NTP) for maximum stability, or hybrid (the default) which snaps fast and rechecks only the shaky boxes.
The model speaks in special tokens. Coordinates are integers in [0, 1000], so they are resolution-independent: divide by 1000, then multiply by the image width or height to get pixels.
Bounding box: <ref>label</ref><box><x1><y1><x2><y2></box>
Point: <box><x><y></box>
No object: <box>none</box>
# coordinates are integers in [0, 1000]A few lines of regex map the [0, 1000] integers onto your image. The worker also ships parse_boxes and parse_points, so you rarely write this yourself.
import re
def parse_boxes(answer, image_width, image_height):
boxes = []
pattern = r"<box><(\d+)><(\d+)><(\d+)><(\d+)></box>"
for m in re.finditer(pattern, answer):
x1, y1, x2, y2 = (int(g) for g in m.groups())
boxes.append({
"x1": x1 / 1000 * image_width,
"y1": y1 / 1000 * image_height,
"x2": x2 / 1000 * image_width,
"y2": y2 / 1000 * image_height,
})
return boxesA 3B model, so it is not demanding. Here is exactly what the repo and model card state about hardware. Numbers that are not published are marked as such, not guessed.
| Component | What the repo says |
|---|---|
| GPU (recommended) | H100 (Hopper). The paper reports 12.7 boxes per second on a single H100. |
| GPU (supported) | A100, L40 and similar. Use --attn_implementation sdpa, which handles sequences up to about 4K tokens. |
| Long-context attention | MagiAttention, for 16K to 32K+ token sequences. It supports Hopper and Blackwell GPUs only. |
| Precision | BF16. The worker loads the model with dtype=torch.bfloat16 by default. |
| Core dependency | transformers==4.57.1, plus accelerate, deepspeed, timm, and peft (auto-installed). |
| Model | nvidia/LocateAnything-3B: a Qwen2.5 decoder and a Moon-ViT vision encoder, about 3B parameters. |
| License | Code is Apache 2.0. Model weights are under the NVIDIA License (see LICENSE_MODEL). |
The repo does not publish an exact VRAM figure. As a 3B model in BF16 it fits comfortably on a single modern data-center GPU; the real memory cost depends on image resolution and sequence length.
the takeaway
Stop spelling out boxes. Start snapping them.
That is the whole idea. Treat a box as one thing, draw it in one move, and only slow down when you actually need to.