Reinforcement Learning for Locomotion

Duration: 70 min · Level: Advanced · Module: 3. Bipedal Locomotion & Whole-Body Control · Focus: RL, locomotion, simulation, training

There is a moment, around 2019, when the field quietly changed direction. For two decades, better walking meant better models and better solvers. Then learned policies — trained in simulation, deployed to real hardware — started beating hand-engineered model-based controllers on the metrics that matter most: robustness, terrain adaptation, energy efficiency, and top speed. This lesson is about that shift and how to ride it. We will cover why simulation training works at all, the handful of techniques (domain randomization, curriculum learning, reference-motion initialization) that make it work, and how to read a learning curve so you know whether your policy is actually improving.

Why train a controller instead of designing one

A model-based controller is only as good as its model. Friction is uncertain, masses shift when the robot carries something, terrain is never quite what the map says. Reinforcement learning sidesteps the modeling burden: instead of deriving the control law, you specify what good behavior is worth — a reward — and let the policy discover the control law through millions of simulated trials.

The headline result is ETH Zurich's ANYmal (2022): an RL locomotion policy trained entirely in simulation and deployed zero-shot — no real-world fine-tuning — onto the physical robot, where it traversed rubble, mud, and stairs while outperforming MPC on every measured metric. "Zero-shot" is the part to dwell on. A policy that has never touched reality, transferred directly, beat a controller hand-built for that reality. That is only possible because of how the simulation was set up.

The three techniques that make sim-to-real work

A naive policy trained in one perfect simulator will overfit to that simulator's exact physics and fall over on the real robot. Three techniques close the gap.

Domain randomization. During training, randomize the things you cannot know precisely — mass, friction, terrain, sensor noise, actuator delays. The policy can no longer rely on any single value, so it learns a strategy robust across the whole distribution. The real robot is then just one more sample from a distribution the policy already handles. This is the single most important reason ANYmal-style zero-shot transfer works.

Curriculum learning. Throw hard terrain at a fresh policy and it never gets a useful reward, never learns, and collapses. The fix is to start on flat terrain and gradually increase difficulty as the policy improves — gentle slopes, then steps, then rubble. The curriculum keeps the reward signal reachable at every stage so learning never stalls.

Reference-motion initialization. A humanoid has 40+ action dimensions, which creates a brutal exploration problem: random flailing almost never stumbles onto a walking gait. Seeding training with reference motion-capture data — human walking, say — gives the policy a sane starting region to explore around and dramatically accelerates learning. This is the core mechanism behind expressive, natural-looking humanoid behavior (Cheng et al., Expressive Whole-Body Control for Humanoid Robots, 2024): imitate human motion to bootstrap, then let RL refine it into something physically realizable on the robot.

Reward design is where the behavior actually lives

The reward function is your real interface to the policy's behavior, and it is more delicate than it looks. A workable locomotion reward looks like:

forward velocity + an alive bonus − energy consumption − penalties for exceeding joint torque limits.

Each term earns its place. Forward velocity is the task. The alive bonus rewards staying up, which prevents the degenerate strategy of diving forward for one big velocity spike and then crashing. The energy and torque penalties shape the gait toward something efficient and hardware-safe rather than a violent, motor-frying scramble. The hard truth from practice: reward shaping matters enormously for behavior quality. Two reward functions that look equivalent on paper can produce a graceful walk or a twitching mess. Most of your engineering time on an RL controller is spent here, not on the algorithm.

The payoff when it is done well is real and measured: Unitree's H1 set a world record at 3.3 m/s walking speed (2024) using a policy trained in Isaac Gym — with no human-designed gait pattern at all. The fastest walking humanoid did not have its gait designed; it had its gait rewarded.

A glimpse past locomotion

The deepest version of this idea reframes control as a sequence-modeling problem. Humanoid Locomotion as Next Token Prediction (Radosavovic et al., 2024) trains a transformer to predict the next state-action token, treating walking the way a language model treats text. You do not need this for a first locomotion policy, but it signals where the field is heading: the same architectures driving foundation models (Module 5) are beginning to absorb low-level control too.

Putting it into practice

This is the lab: in Isaac Lab (on Isaac Sim), stand up a basic locomotion RL training loop for a 12-DOF biped and read its learning curve.

Define the observation space. Give the policy the joint angles, joint velocities, base orientation (from the IMU), and the command (desired forward velocity). This is what the policy "sees" each step.
Define the action space. Use target joint angles — twelve numbers, one per DOF — fed to a low-level PD controller on each joint. Predicting targets rather than raw torques makes learning far easier.
Define the reward. Implement forward velocity − energy, exactly the shaping discussed above. Keep it minimal for the first run so you can see the effect of each term you add later.
Parallelize. Launch many simulated robots at once (Isaac Lab's whole purpose) so the policy sees thousands of environment steps per wall-clock second.
Train for 1000 iterations. Run the loop and log episode reward and episode length each iteration.
Analyze the learning curve. Plot reward versus iteration. A healthy curve climbs then plateaus; rising episode length means the robot is staying up longer (the alive behavior is forming). If reward spikes then collapses, suspect a reward-shaping bug or terrain that is too hard too soon — apply the curriculum. Write one sentence on what your curve tells you about whether the policy learned to walk or merely to not fall.

Key takeaways

Since ~2019, RL locomotion policies have surpassed MPC on robustness, terrain adaptation, efficiency, and top speed — exemplified by ETH's ANYmal traversing rubble and stairs via zero-shot sim-to-real.
Domain randomization (randomize mass, friction, terrain) is the key to crossing the sim-to-real gap: it forces a policy robust to reality's uncertainty.
Curriculum learning keeps reward reachable by ramping difficulty; reference motion-capture initialization tames the 40+-dimensional exploration problem and accelerates learning.
Reward shaping is where behavior is decided — forward velocity + alive bonus − energy − torque penalties — and it consumes most of your tuning effort.
Unitree's H1 hit a record 3.3 m/s with no hand-designed gait, trained in Isaac Gym; the lab reproduces the core loop in Isaac Lab on a 12-DOF biped.

References

Learning to Walk in Minutes Using Massively Parallel Deep RL — Rudin et al. (2022). CoRL 2022
Expressive Whole-Body Control for Humanoid Robots — Cheng et al. (2024). arXiv 2402.16796
Humanoid Locomotion as Next Token Prediction — Radosavovic et al. (2024). arXiv 2402.18844

← Previous: 3.2 Model Predictive Control for Dynamic Walking · Next: 3.4 Whole-Body Control: Moving & Working Simultaneously →

Part of Module 3: Bipedal Locomotion & Whole-Body Control.

Why train a controller instead of designing one​

The three techniques that make sim-to-real work​

Reward design is where the behavior actually lives​

A glimpse past locomotion​

Putting it into practice​

Key takeaways​

References​