Foundation Models for Open-Vocabulary Perception

Duration: 55 min · Level: Intermediate · Module: 4. Perception & Spatial Intelligence · Focus: foundation-models, detection, segmentation, open-vocabulary

Ask a nurse to "bring me the blood pressure cuff" and she finds it instantly, even on a cart she has never seen. Until recently, a robot could not — every object class needed its own trained detector, and a cuff the model had never seen was invisible. Foundation models broke this open. Models like Grounded DINO, SAM 2, and OWL-ViT do open-vocabulary perception: "find the orange pill bottle" works with no training on pill bottles. This lesson shows you how to assemble these models into a perception pipeline that turns a spoken object name into a 3D grasp target.

The shift from closed to open vocabulary

The old paradigm was closed-vocabulary. You picked your object classes in advance, collected labeled examples of each, and trained a detector that could recognize those classes and nothing else. Adding a new object meant a new dataset and a new training run. For a humanoid expected to operate in the messy, unbounded world of a hospital or home — where the set of objects it might be asked to fetch is effectively infinite — this does not scale.

Open-vocabulary models invert the relationship between language and vision. Instead of a fixed list of labels, they take a free-text description and find whatever matches it. The enabling trick across all of them is image-text alignment: during pretraining, the model learns a shared space where the picture of a pill bottle and the words "pill bottle" land near each other. Recognition then becomes a matter of comparing your text query to image regions — which works for objects never explicitly labeled during training. This is why "find the orange pill bottle" succeeds without a single pill-bottle training example.

The three models and what each does best

You will mix and match three families; knowing each one's job prevents you from using the wrong tool.

Grounded DINO (Li et al.) is the detector. It marries DINO, a vision transformer, with a BERT text encoder, so it accepts a natural-language phrase and returns bounding boxes for any object that phrase describes — with strong zero-shot performance. When you need to locate an object you can only describe in words, this is the front of your pipeline.

SAM 2 (Ravi et al., Meta, 2024) is the segmenter. Segment Anything Model 2 produces pixel-perfect masks for images and video in real time, given a prompt as light as a single point click or a bounding box. It does not know object names; it knows boundaries. Hand it the box from Grounded DINO and it carves out the exact silhouette of the target.

OWL-ViT (Google, 2022) is an alternative open-vocabulary detector built on CLIP-style image-text alignment. It serves the same role as Grounded DINO — finding novel objects from text — and is worth knowing as a substitute or comparison point when you are evaluating detectors.

The honest recommendation: for a manipulation pipeline, pair Grounded DINO for detection with SAM 2 for segmentation. Detection alone gives you a coarse box; manipulation needs the precise mask so the gripper contacts the object and not its neighbors. OWL-ViT is a reasonable detector swap if you want to A/B it, but the DINO-plus-SAM combination is the proven workhorse.

Composing the pipeline: from words to a grasp

Individually these models detect or segment. Composed, they turn language into action. The pipeline is a clean handoff:

Grounded DINO detects the named object and emits a bounding box.
SAM 2 segments within that box, producing a precise mask.
The depth image (from your RGB-D camera, Lesson 4.1) reads off the 3D position of the masked pixels.
The robot plans a grasp to that 3D location.

This is the bridge from Lesson 4.1's raw sensing through to manipulation. Notice how each prior lesson feeds in: the RGB-D camera supplies both the color image the foundation models reason over and the depth that lifts a 2D mask into a 3D target. Perception is not one model but a stack, and the foundation models sit at the semantic top of it.

The concrete payoff: tell G1 "bring me the blood pressure cuff" and it identifies the cuff in a cluttered medical cart, segments it precisely, and grasps it without disturbing the surrounding equipment — because the mask told it exactly where the cuff ends and the rest of the cart begins.

Will it run fast enough on the robot?

A pipeline that takes a second per frame is useless for manipulation at human hand speed, so latency is a first-class design concern. On an NVIDIA Orin AGX — the on-board compute class for a humanoid — Grounded DINO runs around 15 FPS and SAM 2 around 30 FPS, giving a combined pipeline of roughly 8 to 10 Hz. That is sufficient for manipulation at human hand speed: the hand does not move so fast that 8-10 perception updates per second leave it blind between frames.

This number is also a budget you must defend. Adding more models to the chain, or running at higher resolution, eats into it. When you design G1's perception loop, treat the 8-10 Hz figure as the headroom you are spending, and measure the real rate on your hardware rather than assuming the published numbers — they were measured on a specific configuration that may not match yours.

Putting it into practice

Build the language-to-grasp pipeline and verify it on a cluttered scene.

Stand up detection. Run Grounded DINO on an RGB image and prompt it with a free-text object name — try something the model was never explicitly trained on, like "orange pill bottle," to confirm open-vocabulary behavior.
Add precise segmentation. Feed the resulting bounding box into SAM 2 and generate the pixel-perfect mask. Inspect that the mask hugs the object and excludes neighbors.
Lift to 3D. Align the mask with the depth image from your RGB-D camera and compute the object's 3D position from the masked depth pixels.
Hand off to grasping. Pass that 3D location to a grasp planner as the target (connecting forward to Module 6).
Test on clutter. Arrange a cluttered tray — a mock medical cart — and command the pipeline to retrieve one named item. Confirm it isolates the target without selecting adjacent objects.
Measure the real rate. Profile the combined pipeline on your actual compute (Orin AGX if you have it) and compare against the 8-10 Hz expectation. If you fall short, decide what to cut — resolution, an extra model, or detection frequency.

Key takeaways

Foundation models replace closed-vocabulary detectors (one trained model per class) with open-vocabulary perception: "find the orange pill bottle" works with zero pill-bottle training, via learned image-text alignment.
Grounded DINO detects objects from natural-language phrases (bounding boxes); SAM 2 segments them into pixel-perfect masks in real time; OWL-ViT is an alternative CLIP-based detector.
The recommended pipeline composes Grounded DINO -> SAM 2 -> depth -> grasp, turning a spoken object name into a 3D manipulation target.
Depth from the RGB-D sensor (Lesson 4.1) lifts the 2D mask into 3D, letting the robot grasp a named item from clutter without disturbing surrounding equipment.
The combined pipeline runs at roughly 8-10 Hz on Orin AGX (Grounded DINO ~15 FPS, SAM 2 ~30 FPS) — fast enough for human-speed manipulation, but a latency budget you must measure and defend on your own hardware.

References

Grounded Language-Image Pre-Training — Li et al. (2023). CVPR 2022
SAM 2: Segment Anything in Images and Videos — Ravi et al. (2024). Meta AI Research 2024

← Previous: 4.3 3D Gaussian Splatting for Robot Scene Understanding

Part of Module 4: Perception & Spatial Intelligence.

The shift from closed to open vocabulary​

The three models and what each does best​

Composing the pipeline: from words to a grasp​

Will it run fast enough on the robot?​

Putting it into practice​

Key takeaways​

References​