NVIDIA Jetson AGX Orin: The Humanoid Brain

Duration: 50 min · Level: Advanced · Module: 9. Edge AI & On-Board Intelligence · Focus: Orin, compute, NVIDIA, hardware

A humanoid that thinks in the cloud stops thinking the moment its Wi-Fi drops — and a robot that freezes mid-step because a packet was late is not a robot you can put near a person. G1 has to carry its own brain. This lesson is about the single most consequential piece of that brain: the on-board compute module, and why the NVIDIA Jetson AGX Orin has become the default answer for builders who need real foundation-model inference on a battery. By the end you will be able to read the Orin spec sheet the way an engineer reads it — as a budget you spend, not a number you brag about.

Why on-board, and why Orin

The case for on-board compute is latency, reliability, and privacy all at once. A locomotion controller correcting a stumble cannot tolerate a round-trip to a datacenter; a robot in a hospital cannot stream video of patients off-site; and a machine that loses connectivity must keep its balance, not crash. So the question is not whether to compute on-board but what will fit inside the compute and power envelope of a mobile humanoid.

That envelope is brutal. Everything you run shares a battery with the actuators, and every watt of compute is a watt not spent walking. The reason the AGX Orin (2022) dominates is that it is, in practice, the only currently available platform that can run vision-language-action (VLA) inference on a robot you can carry. Its headline numbers are 275 TOPS of INT8 performance and 135 TOPS at FP16, with 64GB of LPDDR5 memory and NVMe storage, inside a thermal envelope that peaks around 100W. The combination — not any single figure — is what makes it the practical choice.

Reading the spec sheet like an engineer

Keep these on a card, because you will size your whole stack against them.

Compute: a 12-core Arm Cortex-A78AE CPU paired with a 2048-core Ampere GPU carrying 64 Tensor Cores. The CPU handles the operating system, ROS nodes, and your fast reactive control loop; the GPU and Tensor Cores carry the neural-network inference.
Memory: 64GB of unified LPDDR5. "Unified" matters — CPU and GPU share it, so a large model does not have to be copied across a bus.
Memory bandwidth: 204 GB/s. This is the number most newcomers underweight. For large-model inference, bandwidth, not TOPS, is usually the bottleneck. The GPU can multiply faster than it can fetch weights, so it stalls waiting for memory. The practical consequence is one of the most important insights in this module: quantization helps as much by shrinking bandwidth as by saving space — smaller weights mean fewer bytes to move, and the proportional reduction in bandwidth often buys more real speed than the FLOP savings alone.

A useful note on the TOPS figure itself: the 275 TOPS rating assumes INT8 with sparsity. Dense networks see roughly half. Treat 275 as a ceiling under favorable conditions, not a guarantee for your workload.

Power modes: the tradeoff you cannot escape

Orin lets you pick a power budget, and that choice directly bounds your throughput. The configurable range runs from 15W in maximum-efficiency mode to 60W in maximum-performance mode, with the thermal design targeting 100W peak. This is the lever you will pull constantly: more watts buy more inference, but they drain the battery and generate heat that must go somewhere.

Heat is the hidden constraint. Sustained inference on a single module can drive it into thermal throttling, where the chip silently downclocks to protect itself and your control loop mysteriously slows. The industry's answer is informative: Figure's 02 humanoid uses two Orin NX modules at roughly 10W each — one dedicated to the perception pipeline, one to motor control — rather than a single high-power module. Splitting the workload avoids the thermal wall and gives each function isolation, so a heavy perception frame cannot starve the motor loop. The lesson for G1 is to think about compute as a distributed budget across modules, not a single number to maximize.

What actually runs: realistic throughput

Specs are abstract until you attach models to them. On an AGX Orin you can expect, in round terms:

LLaMA-7B at INT4: around 5 tokens/second — usable for slow language reasoning, not for snappy dialogue.
OpenVLA 7B, quantized: about 1.5 Hz — meaning roughly one action decision every two-thirds of a second. This is the headline problem the next lesson exists to solve.
DepthAnything: 30 FPS — comfortably real-time depth perception.
Grounded DINO: 15 FPS — open-vocabulary object detection at interactive rates.

Notice the split: perception models (DepthAnything, Grounded DINO) run fast, while the big VLA policy crawls at 1.5 Hz. That gap defines your architecture. You will pair a slow, deliberate VLA "thinking" loop with a fast reactive controller on the CPU — and you will compress the VLA hard to close the gap.

That split maps onto the Orin's resources directly — the GPU carries the slow neural workloads, the CPU carries the fast reactive loop:

The software that makes it real: Isaac ROS

Raw silicon is not enough; you need accelerated software to reach these numbers. NVIDIA Isaac ROS provides hardware-accelerated ROS 2 packages tuned for Orin — SLAM, perception, and neural-network inference implemented as CUDA-accelerated nodes that run roughly 2–3× faster than their CPU equivalents. The practical guidance is direct: do not hand-write CUDA for problems Isaac ROS already solves. Use its accelerated nodes for the standard perception and SLAM pipeline, and spend your scarce engineering effort on the parts that genuinely differentiate G1 — the VLA integration and the safety architecture.

Putting it into practice

Build a one-page compute budget for G1 before you write any inference code.

List every continuous workload the robot runs at once: VLA policy, depth, object detection, SLAM, and the low-level controller. Note the target rate for each (e.g., VLA ≥ 4 Hz, depth ~30 FPS).
Assign each to a resource: GPU/Tensor Cores for neural inference, CPU for the reactive loop and ROS orchestration.
Estimate the memory footprint. A 7B model at FP16 needs ~14GB; check that your concurrent models fit under 64GB with headroom for buffers.
Pick a power mode and ask honestly whether your workloads fit at 60W, or whether — like Figure 02 — you should split across two modules to dodge thermal throttling.
Identify your bottleneck. If a model is slow, decide whether it is TOPS-bound or bandwidth-bound (204 GB/s); if bandwidth-bound, quantization is your next move — which is exactly Lesson 9.2.

Key takeaways

On-board compute is non-negotiable for a humanoid: latency, reliability, and privacy all forbid depending on the cloud.
The AGX Orin (2022) is the de facto standard because the combination — 275 TOPS INT8, 64GB LPDDR5, ~100W peak — is the only practical platform for VLA inference on a battery.
Memory bandwidth (204 GB/s), not TOPS, usually limits large-model inference; quantization wins largely by shrinking the bytes you move.
Power mode (15W–60W) directly trades battery and heat against throughput; sustained load risks thermal throttling, so consider splitting work across modules as Figure 02 does.
Realistic throughput is uneven: perception runs at 15–30 FPS while a 7B VLA crawls at ~1.5 Hz — closing that gap is the job of model compression.
Use Isaac ROS's CUDA-accelerated nodes (2–3× faster than CPU) for standard perception and SLAM, and reserve custom effort for G1's differentiating logic.

Next: 9.2 Model Compression for Edge Deployment →

Part of Module 9: Edge AI & On-Board Intelligence.

Why on-board, and why Orin​

Reading the spec sheet like an engineer​

Power modes: the tradeoff you cannot escape​

What actually runs: realistic throughput​

The software that makes it real: Isaac ROS​

Putting it into practice​

Key takeaways​