VLA Models Explained: How Vision-Language-Action Models Control Robots

Understand how Vision-Language-Action models combine computer vision, natural language, and robot control to create general-purpose robots.

Problem: Traditional Robots Can't Generalize

You program a robot arm to pick up a red cup. It works perfectly. Then you ask it to pick up a blue bottle instead—and it fails completely. Traditional robots need separate programs for each task, making general-purpose robotics impossible.

You'll learn:

  • How VLA models unify vision, language, and robot control
  • The architecture differences between RT-2, OpenVLA, and π0
  • When to use discrete tokens vs continuous actions
  • Why cross-embodiment training matters

Time: 20 min | Level: Intermediate


What Are VLA Models?

Vision-Language-Action models are multimodal foundation models that combine three capabilities: vision (camera images), language (natural language instructions), and action (robot motor commands). Instead of writing thousands of lines of code for each task, you train one model that understands "pick up the blue bottle" and figures out the motor commands automatically.

Early milestones like Google DeepMind's RT-1 and RT-2 demonstrated that transformers could map visual and linguistic inputs directly to robotic actions at scale, beginning the VLA model paradigm. The key innovation: treating robot actions as just another modality alongside vision and language.

Think of it this way: ChatGPT processes text. Midjourney generates images. VLAs control physical robots in the real world.


Why This Matters Now

Three breakthroughs converged in 2023-2024:

Large Vision-Language Models: VLMs are specifically trained on large multimodal datasets and can perform a variety of tasks such as image understanding, visual-question answering and reasoning. Pre-trained models like PaLM-E and PaliGemma already understand objects, spatial relationships, and physical concepts from internet data.

Cross-Embodiment Datasets: The Open X-Embodiment dataset, a massive collection of 1M trajectories from 20+ robot types pooled from 22 institutions was used for training models like RT-2-X and OpenVLA to generalize across different robot hardware. One model can now control different robot arms, grippers, and mobile bases.

Scalable Architectures: Transformer-based policies scale like language models. More data and compute directly improve robot performance—something traditional control methods couldn't achieve.


Architecture: How VLAs Work

VLAs share a common high-level architecture articulated in two stages: In the first stage, a pre-trained VLM serves as the perception and reasoning core. It encodes one or more camera images together with a language instruction into a sequence of language tokens in a shared latent space. The second stage outputs robot actions.

Stage 1: Vision-Language Backbone

The VLM processes:

  • Camera images (usually 224x224 or 256x256)
  • Text instruction ("pick up the red cup")
  • Robot state (joint angles, gripper position)

These get encoded into a shared token space—just like language model embeddings, but multimodal.

Stage 2: Action Decoder

This is where VLA architectures diverge into two approaches:

Discrete Token Output (RT-2, OpenVLA):

The model encodes the robot actions as an action string, and the VLA model learns to generate these sequences just as a language model generates text.

# Action represented as tokens
action = ["move_x_+5", "move_y_-2", "gripper_open"]
# Model predicts these like text generation

Pros: Simple training, reuses language model architecture Cons: Converting continuous trajectories into vocabulary symbols can limit spatial accuracy or temporal resolution

Continuous Output (π0, Diffusion):

In order to achieve accurate dexterity and high frequency control, VLAs forego discrete tokens and directly output continuous actions through the use of diffusion models or flow-matching networks.

# Action as continuous trajectory
action = [x: 0.15, y: -0.08, z: 0.22, 
          roll: 0.1, pitch: 0.0, yaw: 0.3,
          gripper: 0.8]  # 50 Hz control

Pros: Higher precision, faster control loops Cons: More complex training with diffusion/flow matching


Key Models You Should Know

RT-2 (Google DeepMind, 2023)

RT-2 was developed by Google DeepMind in mid-2023 and established the vision-language-action model paradigm in robotics by building on two state-of-the-art VLMs, respectively PaLI-X and PaLM-E, by fine-tuning them on real robot demonstration data.

Architecture: PaLM-E VLM + discrete action tokens Training: Web-scale VLM + 130k robot trajectories Notable: First to show internet knowledge transfers to robot control

OpenVLA (2024)

OpenVLA is a 7B parameter, open-source VLA model built on Prismatic VLM and Llama 2, trained on an even larger portion of the Open X-Embodiment data, making VLA research more accessible.

Architecture: Llama 2-7B + Prismatic vision encoder Training: 800k+ trajectories from Open X-Embodiment Notable: Despite its smaller size with respect to Google DeepMind's RT-2, OpenVLA outperforms RT-2 on a suite of manipulation tasks

π0 (Pi-Zero) - Physical Intelligence (2024)

π0 incorporates Paligemma as a pre-trained VLM backbone, built from SigLIP and Gemma encoders, with an action expert trained on robot trajectories from Open X-Embodiment.

Architecture: PaliGemma 3B + 300M flow matching action expert Key Innovation: Mixture-of-Transformers design with separate attention for vision/language and actions Performance: The π0 model operates at 50 Hz with action chunks of size 50 to predict a full second of action

Why π0 differs:

The model adopts a MoE-like architecture where each expert has its own set of parameters and only interacts through attention, and uses a pre-trained 3B PaliGemma VLM with a new set of action expert parameters. The VLM backbone doesn't see future actions during training, preventing confusion with its pre-trained visual understanding.


Training Recipe: The Three-Stage Process

Modern VLAs follow a training recipe inspired by LLM development:

Step 1: VLM Pre-training (Internet Scale)

Train or use existing VLM on billions of image-text pairs. The model learns:

  • Object recognition and semantics
  • Spatial relationships
  • Physical common sense
  • Language grounding

Step 2: Cross-Embodiment Pre-training

By unifying vision, language, and action data at scale, which have traditionally been studied separately, VLA models aim to learn policies that generalise across diverse tasks, objects, embodiments, and environments.

Train on multi-robot datasets (Open X-Embodiment). The model learns:

  • Different robot morphologies (single arm, dual arm, mobile)
  • Task variations across platforms
  • General manipulation skills

Step 3: Post-Training (Task-Specific)

Fine-tune on high-quality task demonstrations. They argue that post training on high-quality data is important as the lower quality data contains mistakes from teleoperation that we don't want robots to imitate.

Typically needs 50-100 demonstrations for new tasks.


Design Choices That Matter

Single vs Dual Architecture

The single-model design, employed by RT-2, OpenVLA and π0, simultaneously understands the scene and the language instruction to produce robot actions in a single forward pass, keeping the architecture simple and reducing latency.

The dual-system design, adopted by Helix and Groot N1, decouples the architecture into two components where the first component is usually slower and handles image observation and text instructions, while the second component runs at a faster rate and produces the robot's actions.

Choose single-model for:

  • Lower latency requirements
  • Simpler deployment
  • Research and prototyping

Choose dual-system for:

  • High-frequency control (>50 Hz)
  • Complex manipulation
  • When perception and control have different update rates

Action Chunking

Action chunking was introduced by Zhao et al. 2023. π0 implements this by running flow matching on multiple action tokens in parallel.

Instead of predicting one action at a time, predict the next 50 actions (1 second at 50 Hz). Benefits:

  • Smoother trajectories
  • Reduces compounding errors
  • Enables parallel inference

When to Use What

Use discrete token VLAs (RT-2, OpenVLA) when:

  • You need simple deployment
  • Training infrastructure is limited
  • Tasks don't require high-frequency control
  • You want to reuse language model code

Use continuous VLAs (π0, diffusion-based) when:

  • Dexterous manipulation is critical
  • You need >10 Hz control frequency
  • Spatial precision matters
  • You have diffusion training expertise

Real-World Performance

π0 attains by far the best results across the board on all the zero-shot tasks, with near perfect success rates on shirt folding and the easier bussing tasks, and large improvements over all baselines.

Benchmark tasks (zero-shot, no fine-tuning):

  • Folding laundry: π0 95%, OpenVLA 30%
  • Bussing tables: π0 85%, Octo 45%
  • Stacking objects: π0 90%, RT-2 60%

OpenVLA struggles on these tasks because its autoregressive discretization architecture does not support action chunks.

But there's a catch: These numbers come from controlled lab environments. Real-world deployment faces challenges like:

  • Lighting variations
  • Novel object shapes
  • Physical failures (slipping, collisions)
  • Long-horizon tasks requiring planning

The Frontier: What's Coming

Recent research from ICLR 2026 and Nature Machine Intelligence shows several trends:

Discrete Diffusion VLAs: Analysis of 164 Vision-Language-Action model submissions at ICLR 2026 discovers trends in discrete diffusion VLAs, reasoning models, and benchmarks. Combining the benefits of tokenization with diffusion's trajectory modeling.

Online Learning: SOP (Scalable Online Post-training) is a framework designed to enable online updates of Vision-Language-Action models across robot fleets where all execution trajectories, reward signals, and human corrections are streamed in real-time. Robots that improve from deployment experience.

Reasoning VLAs: Integrating chain-of-thought and planning before action execution.

Smaller Models: GR00T N1 and π0 both use 2B parameter LLMs in their VLA architectures with small models required to enable on-device inference and real time latency.


Getting Started

For Researchers

OpenVLA is the most accessible starting point:

pip install openvla
# Pre-trained checkpoints available
# Full training code on GitHub

For custom robots: Use the Open X-Embodiment dataset format and fine-tune on your platform.

For Production

Consider:

  • Latency requirements (edge vs cloud)
  • Safety constraints (human-in-the-loop?)
  • Data collection strategy
  • Continuous learning infrastructure

Start small: Fine-tune a pre-trained model on 50-100 demonstrations of your task before scaling.


What You Learned

  • VLAs inject action components into VLMs to build vision–language–action models for robotic tasks and motion planning
  • Discrete tokens (RT-2, OpenVLA) are simpler but less precise than continuous outputs (π0)
  • Cross-embodiment training enables one model to control different robot types
  • Modern VLAs follow a three-stage recipe: VLM pre-training, cross-embodiment training, task-specific fine-tuning
  • Action chunking and proper architecture choices matter more than model size

Limitations: VLAs still struggle with long-horizon planning, recovery from failures, and truly open-world scenarios. They're powerful but not yet general intelligence.


Research cited: Nature Machine Intelligence 2026, ICLR 2026 submissions, Physical Intelligence π0 paper (2024), Google DeepMind RT-2 (2023), OpenVLA (2024). Benchmarks: CALVIN, LIBERO, SIMPLER.