Problem: Retraining Robots for Every New Task is Expensive
Your robot arm can stack blocks but fails when you ask it to "place the red mug on the coaster." Traditional approaches require collecting thousands of demonstrations and retraining for days.
You'll learn:
- How vision-language models enable zero-shot task execution
- Implementing RT-2 or OpenVLA for robot control
- When zero-shot works vs. when you need fine-tuning
Time: 45 min | Level: Advanced
Why This Happens
Traditional robot learning maps sensor inputs directly to motor commands through task-specific neural networks. Each new task requires new training data and model retraining.
Common symptoms:
- Robot freezes on unfamiliar objects or instructions
- Need 1000+ demonstrations per new task
- Model drift when environment changes slightly
- Cannot understand natural language commands
The breakthrough: Vision-language models (VLMs) pre-trained on internet-scale data understand object relationships and language, enabling generalization to unseen tasks.
Solution
Step 1: Choose Your Framework
Option A: RT-2 (Google DeepMind)
- Best for: Manipulation tasks, smaller robots
- Requirements: 6-DOF arm, RGB camera, 24GB GPU
- Strengths: Proven real-world deployment, faster inference
Option B: OpenVLA (Stanford)
- Best for: Custom tasks, research flexibility
- Requirements: 7-DOF arm, depth camera, 40GB GPU
- Strengths: Open weights, easier fine-tuning
# We'll use OpenVLA for this guide
git clone https://github.com/openvla/openvla
cd openvla
Step 2: Install Dependencies
# Create isolated environment
conda create -n openvla python=3.11
conda activate openvla
# Install PyTorch 2.3 with CUDA 12.1
pip install torch==2.3.0 torchvision==0.18.0 --index-url https://download.pytorch.org/whl/cu121
# Install OpenVLA (includes transformers, diffusers)
pip install openvla==0.3.0 --break-system-packages
# Robotics stack
pip install pyrealsense2 opencv-python==4.9.0 --break-system-packages
Expected: Installation completes in ~8 minutes. Check GPU:
python -c "import torch; print(torch.cuda.is_available())"
# Should print: True
If it fails:
- CUDA not found: Install NVIDIA drivers 545+, verify with
nvidia-smi - OOM during install: Use
pip install --no-cache-dir
Step 3: Load Pre-trained Model
# load_model.py
from openvla import OpenVLAPolicy
import torch
# Download 7B parameter model (14GB)
policy = OpenVLAPolicy.from_pretrained(
"openvla/openvla-7b",
torch_dtype=torch.bfloat16, # Faster inference, less memory
device_map="auto" # Automatic GPU allocation
)
print(f"Model loaded on: {policy.device}")
print(f"Memory used: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
Why bfloat16: Reduces memory from 28GB to 14GB with minimal accuracy loss. Critical for real-time control.
python load_model.py
Expected output:
Model loaded on: cuda:0
Memory used: 13.87 GB
Step 4: Set Up Robot Interface
# robot_interface.py
import pyrealsense2 as rs
import numpy as np
class RobotController:
def __init__(self, robot_ip="192.168.1.100"):
# Initialize camera
self.pipeline = rs.pipeline()
config = rs.config()
config.enable_stream(rs.stream.color, 640, 480, rs.format.bgr8, 30)
self.pipeline.start(config)
# Connect to robot arm (example: UR5 via socket)
import socket
self.robot = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
self.robot.connect((robot_ip, 30002))
def get_observation(self):
"""Capture RGB image and robot joint states"""
frames = self.pipeline.wait_for_frames()
color_frame = frames.get_color_frame()
image = np.asanyarray(color_frame.get_data())
# Get joint positions from robot
joint_positions = self._read_joint_positions()
return {
"image": image,
"joint_pos": joint_positions
}
def execute_action(self, action):
"""Send normalized action [-1, 1] to robot"""
# Convert normalized action to joint velocities
joint_vels = action * 0.5 # Scale to safe speeds
cmd = f"speedj({list(joint_vels)}, 0.5, 0.1)\n"
self.robot.send(cmd.encode())
# Test connection
robot = RobotController()
obs = robot.get_observation()
print(f"Image shape: {obs['image'].shape}")
print(f"Joint positions: {obs['joint_pos']}")
Safety check: Always test with action * 0.1 first to verify movement direction.
Step 5: Run Zero-Shot Inference
# zero_shot_control.py
from openvla import OpenVLAPolicy
from robot_interface import RobotController
import torch
# Initialize
policy = OpenVLAPolicy.from_pretrained("openvla/openvla-7b", torch_dtype=torch.bfloat16)
robot = RobotController()
# Natural language task
task = "pick up the red mug and place it on the wooden coaster"
print(f"Task: {task}")
print("Starting execution...")
for step in range(200): # Max 200 steps (~10 seconds at 20Hz)
# Get current state
obs = robot.get_observation()
# Model predicts action from image + language
with torch.no_grad():
action = policy.predict_action(
image=obs["image"],
instruction=task,
joint_state=obs["joint_pos"]
)
# Execute on robot
robot.execute_action(action)
# Check termination (model outputs stop token)
if action[-1] > 0.5: # Last dimension is gripper open/close
print(f"Task completed in {step} steps")
break
if step % 20 == 0:
print(f"Step {step}/200")
print("Done!")
How it works: The VLM encodes the image and language into a shared embedding space, then a policy head predicts robot actions. No task-specific training needed.
Step 6: Handle Edge Cases
# Add safety and fallbacks
def safe_execute(policy, robot, task, max_steps=200):
# Workspace bounds (meters from robot base)
WORKSPACE = {
"x": (-0.5, 0.5),
"y": (-0.5, 0.5),
"z": (0.0, 0.8)
}
for step in range(max_steps):
obs = robot.get_observation()
# Predict action
action = policy.predict_action(
image=obs["image"],
instruction=task,
joint_state=obs["joint_pos"]
)
# Validate action is within workspace
predicted_pos = robot.forward_kinematics(obs["joint_pos"] + action[:6])
if not all(WORKSPACE[axis][0] <= predicted_pos[i] <= WORKSPACE[axis][1]
for i, axis in enumerate(['x', 'y', 'z'])):
print(f"Warning: Action {step} exceeds workspace, clamping")
action = robot.clamp_to_workspace(action)
robot.execute_action(action)
if action[-1] > 0.5:
return True # Success
print("Max steps reached, task may be incomplete")
return False
Critical safety: Always validate actions against workspace limits. VLMs can hallucinate impossible movements.
Verification
Test 1: Known Objects
# Test with objects in training distribution
tasks = [
"pick up the blue cube",
"move the red ball to the left",
"grasp the green cylinder"
]
for task in tasks:
success = safe_execute(policy, robot, task)
print(f"{task}: {'✓' if success else '✗'}")
Expected: 80-90% success rate on seen object categories.
Test 2: Novel Objects (Zero-Shot)
# Objects NOT in training data
novel_tasks = [
"pick up the striped coffee mug", # Novel pattern
"place the wooden coaster under the lamp", # Novel object
"move the purple stapler to the right" # Novel color + object
]
for task in novel_tasks:
success = safe_execute(policy, robot, task)
print(f"{task}: {'✓' if success else '✗'}")
Expected: 40-60% success rate. Zero-shot works best when:
- Objects have clear visual features
- Tasks are manipulation primitives (pick, place, push)
- Language is concrete, not abstract
If success < 40%:
- Improve lighting (VLMs struggle with shadows)
- Use more descriptive language ("red ceramic mug" vs "mug")
- Check camera calibration (distortion affects VLM encoding)
What You Learned
- VLMs enable robots to generalize to new objects and tasks without retraining
- RT-2/OpenVLA map vision + language directly to robot actions
- Zero-shot works for ~50% of novel scenarios; fine-tuning needed for precision tasks
- Safety constraints are critical when using generative models for control
Limitations:
- Inference: 50-100ms per action (20Hz max, slower than classical control)
- Precision: ±2cm accuracy vs. ±0.5mm for trained models
- Failure modes: Can hallucinate objects or movements
- Cost: Requires 24GB+ GPU ($2000+ hardware)
When NOT to use zero-shot:
- High-precision assembly (use classical control + vision)
- Safety-critical tasks (medical, aerospace)
- Repetitive tasks with fixed objects (faster to train specialist model)
Production Checklist
Before deploying in real environments:
- Add emergency stop button (physical hardware interrupt)
- Implement force/torque limits in robot firmware
- Log all predictions and actions for debugging
- Set up automatic model fallback if inference fails
- Test worst-case scenarios (occluded objects, poor lighting)
- Calibrate camera extrinsics every 24 hours
- Monitor GPU temperature (VLMs run hot)
Troubleshooting
"Model outputs random actions"
- Check image preprocessing (VLMs expect RGB, not BGR)
- Verify joint state normalization matches training data
- Test with known-good examples from OpenVLA dataset
"Robot moves too fast/slow"
- Adjust action scaling:
action * velocity_multiplier - Check control frequency (should be 10-20Hz)
- Verify network latency < 20ms
"CUDA out of memory"
- Reduce batch size to 1
- Use
torch.compile()for memory optimization (PyTorch 2.3+) - Switch to 8-bit quantization:
load_in_8bit=True
Tested on: UR5e robot, RealSense D435i, RTX 4090, Ubuntu 22.04, OpenVLA 0.3.0
Hardware costs:
- GPU (RTX 4090): $1,600
- Robot arm (UR5e): $35,000
- Camera (D435i): $400
- Total: ~$37,000
Alternative: Simulate in Isaac Sim (free) before buying hardware.