Skip to main content

OpenVLA — The Open-Source VLA Ecosystem

Duration: 60 min · Level: Advanced · Module: 5. Foundation Models & VLA Architecture · Focus: OpenVLA, open-source, fine-tuning, VLA

Learning objectives

By the end of this lesson you will be able to explain and apply:

  • OpenVLA architecture
  • Performance
  • Fine-tuning
  • Action tokenization
  • OpenVLA-OFT (2025 follow-up)

Why this matters

OpenVLA (June 2024, Stanford + Berkeley) is the first fully open-source VLA that matches RT-2-X performance on standard benchmarks.

Overview

OpenVLA (June 2024, Stanford + Berkeley) is the first fully open-source VLA that matches RT-2-X performance on standard benchmarks. Its release democratized VLA research: any lab can fine-tune a state-of-the-art general-purpose robot policy on their own hardware.

Key concepts

Key idea

OpenVLA architecture: 7.5B parameters based on Prismatic VLM (SigLIP vision encoder + Llama 2 language model); actions tokenized as discrete language tokens

  • Performance: matches RT-2-X on BridgeV2 benchmark, outperforms on several manipulation tasks; runs at 6 Hz on single A100 GPU, 1.5 Hz on NVIDIA Orin AGX
  • Fine-tuning: LoRA fine-tuning on 200-500 demonstrations converges in 2-4 hours on single GPU; makes task-specific adaptation practical for small labs
  • Action tokenization: continuous actions discretized to 256 bins per dimension, encoded as language tokens; allows use of LLM training infrastructure directly
  • OpenVLA-OFT (2025 follow-up): parallel decoding + action chunking reduces latency to 25+ Hz; enables contact-rich, high-frequency manipulation
  • Available at: github.com/openvla/openvla with pre-trained checkpoints on HuggingFace; Apache 2.0 license

Check your understanding

Try to recall each answer before expanding it.

Q1. What do you know about OpenVLA architecture?

7.5B parameters based on Prismatic VLM (SigLIP vision encoder + Llama 2 language model); actions tokenized as discrete language tokens

Q2. What do you know about Performance?

matches RT-2-X on BridgeV2 benchmark, outperforms on several manipulation tasks; runs at 6 Hz on single A100 GPU, 1.5 Hz on NVIDIA Orin AGX

Q3. What do you know about Fine-tuning?

LoRA fine-tuning on 200-500 demonstrations converges in 2-4 hours on single GPU; makes task-specific adaptation practical for small labs

Q4. What do you know about Action tokenization?

continuous actions discretized to 256 bins per dimension, encoded as language tokens; allows use of LLM training infrastructure directly

Q5. What do you know about OpenVLA-OFT (2025 follow-up)?

parallel decoding + action chunking reduces latency to 25+ Hz; enables contact-rich, high-frequency manipulation

References

  • OpenVLA: An Open-Source Vision-Language-Action Model — Kim et al. (2024). arXiv 2406.09246

← Previous: 5.2 π0 — Diffusion-Based Whole-Body Control · Next: 5.4 Diffusion Policy — Visuomotor Control via Denoising

Part of Module 5: Foundation Models & VLA Architecture.