Problem: Automating Desktops Without Writing Per-App Scripts

Claude Computer Use API lets you automate any desktop task — filling forms, clicking buttons, reading screens — using a vision-capable AI instead of brittle CSS selectors or recorded macros.

Traditional tools like Playwright or PyAutoGUI require you to know the app's DOM or screen coordinates ahead of time. The Computer Use API sends screenshots to Claude, which decides what to click, type, or scroll next. It works on anything visible — legacy apps, Electron tools, browser GUIs, PDFs.

You'll learn:

How to set up the Computer Use API with Docker and Python 3.12
How to send tool calls that let Claude control a virtual desktop
How to run a real automation loop: screenshot → Claude decides → action → repeat
How to deploy this on an AWS EC2 instance (us-east-1) for production use

Time: 25 min | Difficulty: Intermediate

Why This Works Differently From Other Automation Tools

Most desktop automation tools fail when the UI changes. The Computer Use API doesn't hard-code coordinates or selectors — Claude reads the current screenshot and reasons about what action to take next.

Anthropic ships three tools in the computer-use-2024-10-22 beta:

computer — moves the mouse, clicks, types, takes screenshots
bash — runs shell commands inside the sandboxed environment
text_editor — reads and writes files with surgical precision

Symptoms that bring developers here:

PyAutoGUI breaks after every OS update
Playwright can't reach Electron or native app windows
Selenium fails on shadow DOM and iframes in enterprise tools
You need AI judgment mid-task, not just a replay of recorded clicks

Claude Computer Use API automation loop: screenshot to decision to action Claude's computer use loop: capture screenshot → send to API → parse tool call → execute action → repeat until task done

Solution

Step 1: Run the Reference Docker Container

Anthropic provides an official Docker image with a full Ubuntu 22.04 desktop, noVNC, and all tool dependencies pre-wired.

# Pull the official Computer Use demo image
docker pull ghcr.io/anthropics/anthropic-quickstarts:computer-use-demo-latest

# Run with your API key — exposes noVNC on 6080, API on 8080
docker run \
  -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
  -v $HOME/.anthropic:/home/computeruse/.anthropic \
  -p 5900:5900 \     # VNC direct
  -p 6080:6080 \     # noVNC browser view
  -p 8501:8501 \     # Streamlit demo UI
  -p 8080:8080 \     # HTTP API passthrough
  -it ghcr.io/anthropics/anthropic-quickstarts:computer-use-demo-latest

Expected output: Container starts, noVNC available at http://localhost:6080/vnc.html

If it fails:

port already in use → lsof -i :6080 and kill the conflicting process
permission denied on /home/computeruse → run with --user root for local dev only

Step 2: Install the Python SDK

Outside the container, set up your local controller script.

# Requires Python 3.11+ — use uv for speed
uv venv .venv && source .venv/bin/activate
uv pip install anthropic>=0.40.0

Verify the install:

python -c "import anthropic; print(anthropic.__version__)"
# Expected: 0.40.0 or higher

Step 3: Define the Computer Use Tools

The API requires you to declare which tools Claude is allowed to use. Display dimensions must match the container's virtual screen exactly — mismatches cause off-target clicks.

import anthropic

client = anthropic.Anthropic()

# Match these to the container's virtual display resolution
SCREEN_WIDTH = 1024
SCREEN_HEIGHT = 768

tools = [
    {
        "type": "computer_20241022",   # computer_use_tool beta name — exact string required
        "name": "computer",
        "display_width_px": SCREEN_WIDTH,
        "display_height_px": SCREEN_HEIGHT,
        "display_number": 1,           # :1 Xvfb display inside the container
    },
    {
        "type": "bash_20241022",
        "name": "bash",
    },
    {
        "type": "text_editor_20241022",
        "name": "str_replace_editor",
    },
]

Step 4: Write the Automation Loop

This is the core of every Computer Use integration. You send a task, Claude takes a screenshot, decides on an action, you execute it, and feed the result back until Claude says it's done.

import base64
import subprocess
import anthropic

client = anthropic.Anthropic()

def take_screenshot() -> str:
    """Capture the virtual display and return base64 PNG."""
    result = subprocess.run(
        ["scrot", "-", "--format", "png"],  # scrot is pre-installed in the container
        capture_output=True,
    )
    return base64.standard_b64encode(result.stdout).decode("utf-8")

def execute_tool_call(tool_name: str, tool_input: dict) -> str:
    """Route Claude's tool call to the right executor."""
    if tool_name == "computer":
        action = tool_input["action"]
        if action == "screenshot":
            # Claude is asking for a fresh screenshot
            return take_screenshot()
        elif action == "left_click":
            x, y = tool_input["coordinate"]
            subprocess.run(["xdotool", "mousemove", str(x), str(y), "click", "1"])
            return "clicked"
        elif action == "type":
            text = tool_input["text"]
            subprocess.run(["xdotool", "type", "--clearmodifiers", text])
            return "typed"
        elif action == "key":
            subprocess.run(["xdotool", "key", tool_input["key"]])
            return "key sent"
    elif tool_name == "bash":
        result = subprocess.run(
            tool_input["command"],
            shell=True,
            capture_output=True,
            text=True,
            timeout=30,  # Prevent runaway commands from hanging the loop
        )
        return result.stdout + result.stderr
    return "unknown tool"

def run_computer_use_task(task: str) -> str:
    """Main loop: send task, execute tool calls, return when Claude finishes."""
    messages = [{"role": "user", "content": task}]

    while True:
        response = client.beta.messages.create(
            model="claude-opus-4-5-20251101",  # Use Opus for complex visual reasoning
            max_tokens=4096,
            tools=tools,
            messages=messages,
            betas=["computer-use-2024-10-22"],  # Required beta header
        )

        # Append Claude's response to the conversation
        messages.append({"role": "assistant", "content": response.content})

        # Check if Claude is done
        if response.stop_reason == "end_turn":
            # Extract final text response
            for block in response.content:
                if hasattr(block, "text"):
                    return block.text
            return "Task complete."

        # Process tool calls
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = execute_tool_call(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result,
                })

        # Feed results back so Claude can continue
        messages.append({"role": "user", "content": tool_results})

Step 5: Run a Real Task

With the loop ready, test it with a task that requires actual GUI interaction.

if __name__ == "__main__":
    result = run_computer_use_task(
        "Open the Firefox browser, navigate to https://example.com, "
        "take a screenshot, and save it to /tmp/example_screenshot.png"
    )
    print(result)

Expected output:

Opened Firefox, navigated to https://example.com, and saved screenshot to /tmp/example_screenshot.png.

If it fails:

display :1 not found → you're running outside the container; exec into it first: docker exec -it <container_id> bash
xdotool: command not found → apt-get install xdotool inside the container
anthropic.BadRequestError: beta not enabled → confirm betas=["computer-use-2024-10-22"] is set

Step 6: Deploy on AWS EC2 (Production)

For production automation jobs, run the container on a t3.medium or larger in us-east-1. The Computer Use API costs $3.00/MTok input and $15.00/MTok output (Claude Sonnet 3.5 pricing as of March 2026). Each automation loop averages 5–15 API calls depending on task complexity.

# On your EC2 instance — Amazon Linux 2023 or Ubuntu 22.04
sudo yum install docker -y && sudo systemctl start docker

# Pull and run — same command as local, no GUI needed for headless tasks
docker run -d \
  -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
  -p 8080:8080 \
  --restart unless-stopped \
  ghcr.io/anthropics/anthropic-quickstarts:computer-use-demo-latest

Add your task runner as a cron job or trigger it via AWS Lambda → SQS → EC2 for event-driven automation.

Verification

Run this inside the container to confirm all tools are wired:

python3 -c "
import anthropic
c = anthropic.Anthropic()
r = c.beta.messages.create(
    model='claude-opus-4-5-20251101',
    max_tokens=256,
    tools=[{'type': 'computer_20241022', 'name': 'computer',
            'display_width_px': 1024, 'display_height_px': 768, 'display_number': 1}],
    messages=[{'role': 'user', 'content': 'Take a screenshot.'}],
    betas=['computer-use-2024-10-22'],
)
print(r.stop_reason, [b.type for b in r.content])
"

You should see: tool_use ['tool_use'] — Claude responded with a screenshot request, confirming the tool is active.

What You Learned

The Computer Use API works by giving Claude vision + tool call access to a real display — no selectors, no coordinates baked in
The betas=["computer-use-2024-10-22"] header is required; omitting it returns a 400 error
Display resolution in your tool definition must match the container's actual Xvfb display or clicks land in the wrong place
The automation loop is just a while loop: response → execute tool calls → feed results back → check stop_reason
For production on AWS, budget roughly $0.05–$0.20 per completed task at Claude Sonnet 3.5 rates ($3/$15 per MTok)

Tested on claude-opus-4-5-20251101, anthropic Python SDK 0.40.0, Docker 26, Ubuntu 22.04

FAQ

Q: Does Claude Computer Use API work on Windows or macOS hosts? A: The Docker container runs Ubuntu internally, so your host OS doesn't matter. You run the container on any OS that supports Docker, including Windows 11 with WSL2 and macOS M2/M3.

Q: What is the difference between Computer Use API and Playwright? A: Playwright automates Chromium via the DevTools Protocol — fast, reliable, but only works in browsers and breaks on shadow DOM in some enterprise SPAs. Computer Use works on anything visible on screen: native apps, PDFs, Electron tools, and legacy GUIs that have no accessible DOM.

Q: How much VRAM or RAM does the container need? A: The container itself uses about 512MB RAM — the heavy lifting is done server-side by Anthropic's API. A t3.medium (4GB RAM) on EC2 handles it easily. No GPU required.

Q: Can I use Computer Use API with Claude Sonnet instead of Opus? A: Yes. Swap claude-opus-4-5-20251101 for claude-sonnet-4-6 for faster, cheaper runs. Sonnet is recommended for repetitive or well-defined tasks; use Opus when the task requires multi-step visual reasoning or handling unexpected UI states.