Skip to main content

Diffusion Policy — Visuomotor Control via Denoising

Duration: 60 min · Level: Advanced · Module: 5. Foundation Models & VLA Architecture · Focus: diffusion-policy, visuomotor, bimanual, manipulation

Learning objectives

By the end of this lesson you will be able to explain and apply:

  • Key insight
  • Two variants
  • Architecture
  • State-of-the-art on
  • ACT (Action Chunking with Transformers, Zhao et al. 2023)

Why this matters

Diffusion Policy (Chi et al., RSS 2023) applies denoising diffusion probabilistic models to robot action generation.

Overview

Diffusion Policy (Chi et al., RSS 2023) applies denoising diffusion probabilistic models to robot action generation. Unlike direct regression, it learns a score function that can represent multimodal action distributions — critical for tasks where multiple correct trajectories exist (e.g., "place the cup somewhere on the table").

Key concepts

Key idea

Key insight: direct regression predicts a single mean action; diffusion policy predicts a distribution; crucial for contact-rich tasks with multiple valid solutions

  • Two variants: DDPM-based (slower, 100-step denoising) and consistency policy (faster, 1-4 step denoising); for real-time use, consistency policy preferred
  • Architecture: conditioning on RGB observations via ResNet/ViT encoder; denoising network is U-Net or transformer; action space is joint angles or end-effector deltas
  • State-of-the-art on: RoboMimic, RLBench, BridgeV2; particularly strong on bimanual tasks and contact-rich manipulation
  • ACT (Action Chunking with Transformers, Zhao et al. 2023): related approach using chunked action prediction; used in Stanford ALOHA bimanual robot system
  • ALOHA 2 (Stanford, 2024): 14-DOF bimanual system trained with ACT on 50 demonstrations; achieves cooking, surgery simulation tasks with 90%+ success rate

Check your understanding

Try to recall each answer before expanding it.

Q1. What do you know about Key insight?

direct regression predicts a single mean action; diffusion policy predicts a distribution; crucial for contact-rich tasks with multiple valid solutions

Q2. What do you know about Two variants?

DDPM-based (slower, 100-step denoising) and consistency policy (faster, 1-4 step denoising); for real-time use, consistency policy preferred

Q3. What do you know about Architecture?

conditioning on RGB observations via ResNet/ViT encoder; denoising network is U-Net or transformer; action space is joint angles or end-effector deltas

Q4. What do you know about State-of-the-art on?

RoboMimic, RLBench, BridgeV2; particularly strong on bimanual tasks and contact-rich manipulation

Q5. What do you know about ACT (Action Chunking with Transformers, Zhao et al. 2023)?

related approach using chunked action prediction; used in Stanford ALOHA bimanual robot system

References

  • Diffusion Policy: Visuomotor Policy Learning via Action Diffusion — Chi et al. (2023). RSS 2023
  • Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware — Zhao et al. (2023). RSS 2023

← Previous: 5.3 OpenVLA — The Open-Source VLA Ecosystem · Next: 5.5 Deploying VLAs on G1: Architecture & Integration

Part of Module 5: Foundation Models & VLA Architecture.