Problem: Traditional Robots Can't Generalize
You program a robot arm to pick up a red cup. It works perfectly. Then you ask it to pick up a blue bottle instead—and it fails completely. Traditional robots need separate programs for each task, making general-purpose robotics impossible.
You'll learn:
- How VLA models unify vision, language, and robot control
- The architecture differences between RT-2, OpenVLA, and π0
- When to use discrete tokens vs continuous actions
- Why cross-embodiment training matters
Time: 20 min | Level: Intermediate
What Are VLA Models?
Vision-Language-Action models are multimodal foundation models that combine three capabilities: vision (camera images), language (natural language instructions), and action (robot motor commands). Instead of writing thousands of lines of code for each task, you train one model that understands "pick up the blue bottle" and figures out the motor commands automatically.
Early milestones like Google DeepMind's RT-1 and RT-2 demonstrated that transformers could map visual and linguistic inputs directly to robotic actions at scale, beginning the VLA model paradigm. The key innovation: treating robot actions as just another modality alongside vision and language.
Think of it this way: ChatGPT processes text. Midjourney generates images. VLAs control physical robots in the real world.
Why This Matters Now
Three breakthroughs converged in 2023-2024:
Large Vision-Language Models: VLMs are specifically trained on large multimodal datasets and can perform a variety of tasks such as image understanding, visual-question answering and reasoning. Pre-trained models like PaLM-E and PaliGemma already understand objects, spatial relationships, and physical concepts from internet data.
Cross-Embodiment Datasets: The Open X-Embodiment dataset, a massive collection of 1M trajectories from 20+ robot types pooled from 22 institutions was used for training models like RT-2-X and OpenVLA to generalize across different robot hardware. One model can now control different robot arms, grippers, and mobile bases.
Scalable Architectures: Transformer-based policies scale like language models. More data and compute directly improve robot performance—something traditional control methods couldn't achieve.
Architecture: How VLAs Work
VLAs share a common high-level architecture articulated in two stages: In the first stage, a pre-trained VLM serves as the perception and reasoning core. It encodes one or more camera images together with a language instruction into a sequence of language tokens in a shared latent space. The second stage outputs robot actions.
Stage 1: Vision-Language Backbone
The VLM processes:
- Camera images (usually 224x224 or 256x256)
- Text instruction ("pick up the red cup")
- Robot state (joint angles, gripper position)
These get encoded into a shared token space—just like language model embeddings, but multimodal.
Stage 2: Action Decoder
This is where VLA architectures diverge into two approaches:
Discrete Token Output (RT-2, OpenVLA):
The model encodes the robot actions as an action string, and the VLA model learns to generate these sequences just as a language model generates text.
# Action represented as tokens
action = ["move_x_+5", "move_y_-2", "gripper_open"]
# Model predicts these like text generation
Pros: Simple training, reuses language model architecture Cons: Converting continuous trajectories into vocabulary symbols can limit spatial accuracy or temporal resolution
Continuous Output (π0, Diffusion):
In order to achieve accurate dexterity and high frequency control, VLAs forego discrete tokens and directly output continuous actions through the use of diffusion models or flow-matching networks.
# Action as continuous trajectory
action = [x: 0.15, y: -0.08, z: 0.22,
roll: 0.1, pitch: 0.0, yaw: 0.3,
gripper: 0.8] # 50 Hz control
Pros: Higher precision, faster control loops Cons: More complex training with diffusion/flow matching
Key Models You Should Know
RT-2 (Google DeepMind, 2023)
RT-2 was developed by Google DeepMind in mid-2023 and established the vision-language-action model paradigm in robotics by building on two state-of-the-art VLMs, respectively PaLI-X and PaLM-E, by fine-tuning them on real robot demonstration data.
Architecture: PaLM-E VLM + discrete action tokens Training: Web-scale VLM + 130k robot trajectories Notable: First to show internet knowledge transfers to robot control
OpenVLA (2024)
OpenVLA is a 7B parameter, open-source VLA model built on Prismatic VLM and Llama 2, trained on an even larger portion of the Open X-Embodiment data, making VLA research more accessible.
Architecture: Llama 2-7B + Prismatic vision encoder Training: 800k+ trajectories from Open X-Embodiment Notable: Despite its smaller size with respect to Google DeepMind's RT-2, OpenVLA outperforms RT-2 on a suite of manipulation tasks
π0 (Pi-Zero) - Physical Intelligence (2024)
π0 incorporates Paligemma as a pre-trained VLM backbone, built from SigLIP and Gemma encoders, with an action expert trained on robot trajectories from Open X-Embodiment.
Architecture: PaliGemma 3B + 300M flow matching action expert Key Innovation: Mixture-of-Transformers design with separate attention for vision/language and actions Performance: The π0 model operates at 50 Hz with action chunks of size 50 to predict a full second of action
Why π0 differs:
The model adopts a MoE-like architecture where each expert has its own set of parameters and only interacts through attention, and uses a pre-trained 3B PaliGemma VLM with a new set of action expert parameters. The VLM backbone doesn't see future actions during training, preventing confusion with its pre-trained visual understanding.
Training Recipe: The Three-Stage Process
Modern VLAs follow a training recipe inspired by LLM development:
Step 1: VLM Pre-training (Internet Scale)
Train or use existing VLM on billions of image-text pairs. The model learns:
- Object recognition and semantics
- Spatial relationships
- Physical common sense
- Language grounding
Step 2: Cross-Embodiment Pre-training
By unifying vision, language, and action data at scale, which have traditionally been studied separately, VLA models aim to learn policies that generalise across diverse tasks, objects, embodiments, and environments.
Train on multi-robot datasets (Open X-Embodiment). The model learns:
- Different robot morphologies (single arm, dual arm, mobile)
- Task variations across platforms
- General manipulation skills
Step 3: Post-Training (Task-Specific)
Fine-tune on high-quality task demonstrations. They argue that post training on high-quality data is important as the lower quality data contains mistakes from teleoperation that we don't want robots to imitate.
Typically needs 50-100 demonstrations for new tasks.
Design Choices That Matter
Single vs Dual Architecture
The single-model design, employed by RT-2, OpenVLA and π0, simultaneously understands the scene and the language instruction to produce robot actions in a single forward pass, keeping the architecture simple and reducing latency.
The dual-system design, adopted by Helix and Groot N1, decouples the architecture into two components where the first component is usually slower and handles image observation and text instructions, while the second component runs at a faster rate and produces the robot's actions.
Choose single-model for:
- Lower latency requirements
- Simpler deployment
- Research and prototyping
Choose dual-system for:
- High-frequency control (>50 Hz)
- Complex manipulation
- When perception and control have different update rates
Action Chunking
Action chunking was introduced by Zhao et al. 2023. π0 implements this by running flow matching on multiple action tokens in parallel.
Instead of predicting one action at a time, predict the next 50 actions (1 second at 50 Hz). Benefits:
- Smoother trajectories
- Reduces compounding errors
- Enables parallel inference
When to Use What
Use discrete token VLAs (RT-2, OpenVLA) when:
- You need simple deployment
- Training infrastructure is limited
- Tasks don't require high-frequency control
- You want to reuse language model code
Use continuous VLAs (π0, diffusion-based) when:
- Dexterous manipulation is critical
- You need >10 Hz control frequency
- Spatial precision matters
- You have diffusion training expertise
Real-World Performance
π0 attains by far the best results across the board on all the zero-shot tasks, with near perfect success rates on shirt folding and the easier bussing tasks, and large improvements over all baselines.
Benchmark tasks (zero-shot, no fine-tuning):
- Folding laundry: π0 95%, OpenVLA 30%
- Bussing tables: π0 85%, Octo 45%
- Stacking objects: π0 90%, RT-2 60%
OpenVLA struggles on these tasks because its autoregressive discretization architecture does not support action chunks.
But there's a catch: These numbers come from controlled lab environments. Real-world deployment faces challenges like:
- Lighting variations
- Novel object shapes
- Physical failures (slipping, collisions)
- Long-horizon tasks requiring planning
The Frontier: What's Coming
Recent research from ICLR 2026 and Nature Machine Intelligence shows several trends:
Discrete Diffusion VLAs: Analysis of 164 Vision-Language-Action model submissions at ICLR 2026 discovers trends in discrete diffusion VLAs, reasoning models, and benchmarks. Combining the benefits of tokenization with diffusion's trajectory modeling.
Online Learning: SOP (Scalable Online Post-training) is a framework designed to enable online updates of Vision-Language-Action models across robot fleets where all execution trajectories, reward signals, and human corrections are streamed in real-time. Robots that improve from deployment experience.
Reasoning VLAs: Integrating chain-of-thought and planning before action execution.
Smaller Models: GR00T N1 and π0 both use 2B parameter LLMs in their VLA architectures with small models required to enable on-device inference and real time latency.
Getting Started
For Researchers
OpenVLA is the most accessible starting point:
pip install openvla
# Pre-trained checkpoints available
# Full training code on GitHub
For custom robots: Use the Open X-Embodiment dataset format and fine-tune on your platform.
For Production
Consider:
- Latency requirements (edge vs cloud)
- Safety constraints (human-in-the-loop?)
- Data collection strategy
- Continuous learning infrastructure
Start small: Fine-tune a pre-trained model on 50-100 demonstrations of your task before scaling.
What You Learned
- VLAs inject action components into VLMs to build vision–language–action models for robotic tasks and motion planning
- Discrete tokens (RT-2, OpenVLA) are simpler but less precise than continuous outputs (π0)
- Cross-embodiment training enables one model to control different robot types
- Modern VLAs follow a three-stage recipe: VLM pre-training, cross-embodiment training, task-specific fine-tuning
- Action chunking and proper architecture choices matter more than model size
Limitations: VLAs still struggle with long-horizon planning, recovery from failures, and truly open-world scenarios. They're powerful but not yet general intelligence.
Research cited: Nature Machine Intelligence 2026, ICLR 2026 submissions, Physical Intelligence π0 paper (2024), Google DeepMind RT-2 (2023), OpenVLA (2024). Benchmarks: CALVIN, LIBERO, SIMPLER.