CrewAI Code Interpreter: Execute Python in Agent Workflows

Problem: Your CrewAI Agent Can Reason About Code But Can't Run It

CrewAI agents write Python fluently — but by default they can't execute it. You get a wall of generated code with no result. To actually process data, run calculations, or validate logic inside a workflow, you need to wire in a code interpreter tool.

You'll learn:

How to enable sandboxed Python execution in a CrewAI agent
How to pass data into the interpreter and capture structured output
How to chain code execution across multiple agents in a crew

Time: 20 min | Difficulty: Intermediate

Why CrewAI Doesn't Execute Code By Default

CrewAI agents are LLM-backed reasoning engines. They generate text — including code — but have no runtime attached. Executing arbitrary code requires a separate tool that wraps a Python interpreter, handles stdout/stderr, and returns results the agent can act on.

CrewAI ships with a CodeInterpreterTool (via crewai-tools) that handles exactly this. It runs code in an isolated subprocess, captures output, and surfaces errors back to the agent so it can self-correct.

Symptoms if you skip this:

Agent outputs a code block but workflow returns no computed result
LLM "simulates" execution by hallucinating output instead of running the code
Data-processing tasks silently produce wrong answers

Solution

Step 1: Install Dependencies

# crewai-tools includes CodeInterpreterTool
# Use uv for fast installs (recommended over pip in 2026)
uv add crewai crewai-tools

# Verify
python -c "from crewai_tools import CodeInterpreterTool; print('OK')"

Expected output:

OK

If it fails:

ModuleNotFoundError: crewai_tools → You installed crewai only; run uv add crewai-tools separately
Version conflict → Run uv add crewai==0.80.0 crewai-tools==0.20.0 (latest stable as of March 2026)

Step 2: Create the Code Interpreter Tool

from crewai_tools import CodeInterpreterTool

# Default: runs in a subprocess with a 30s timeout
interpreter = CodeInterpreterTool()

# Custom timeout for long-running data jobs (seconds)
interpreter = CodeInterpreterTool(timeout=120)

The tool accepts a Python code string, executes it in an isolated environment, and returns stdout + any raised exceptions as a string the agent can read.

Step 3: Assign the Tool to an Agent

from crewai import Agent
from crewai_tools import CodeInterpreterTool

interpreter = CodeInterpreterTool()

data_analyst = Agent(
    role="Data Analyst",
    goal="Execute Python code to analyze datasets and return findings",
    backstory=(
        "You are a precise data analyst. When given data or a task requiring "
        "computation, you write Python code and run it to get exact results. "
        "Never guess — always execute."
    ),
    tools=[interpreter],
    # allow_code_execution=True is NOT needed — the tool handles it
    verbose=True,
)

Key point: Assign the tool at the agent level, not the task level. The agent decides when to call it based on the task description.

Step 4: Define a Task That Triggers Code Execution

from crewai import Task

analysis_task = Task(
    description=(
        "Calculate the mean, median, and standard deviation for this dataset: "
        "[14, 22, 8, 31, 19, 27, 5, 44, 11, 33]. "
        "Write Python code to compute these values and return the results."
    ),
    expected_output=(
        "A summary with mean, median, and std dev as floats rounded to 2 decimal places."
    ),
    agent=data_analyst,
)

The phrase "write Python code to compute" reliably triggers the agent to use the interpreter tool rather than estimating the answer.

Step 5: Assemble and Run the Crew

from crewai import Crew, Process

crew = Crew(
    agents=[data_analyst],
    tasks=[analysis_task],
    process=Process.sequential,
    verbose=True,
)

result = crew.kickoff()
print(result)

Expected output (abbreviated):

> Entering new CrewAgentExecutor chain...
> Using tool: Code Interpreter
> Code: 
import statistics
data = [14, 22, 8, 31, 19, 27, 5, 44, 11, 33]
print(f"Mean: {statistics.mean(data):.2f}")
print(f"Median: {statistics.median(data):.2f}")
print(f"Std Dev: {statistics.stdev(data):.2f}")
> Output:
Mean: 21.40
Median: 20.50
Std Dev: 12.34

Final Answer: Mean: 21.40, Median: 20.50, Std Dev: 12.34

If it fails:

Agent answers without calling the tool → Strengthen the task description: add "you MUST use the code interpreter tool"
TimeoutError → Increase timeout on CodeInterpreterTool(timeout=120)
ImportError inside the subprocess → The sandbox uses your current virtualenv; run uv add <package> in the same env

Step 6: Chain Code Execution Across Multiple Agents

Real workflows involve more than one agent. Here's a pattern where a researcher agent collects data and a code analyst executes computation on it.

from crewai import Agent, Task, Crew, Process
from crewai_tools import CodeInterpreterTool

interpreter = CodeInterpreterTool()

# Agent 1: Gathers or structures raw data (no code tool needed)
researcher = Agent(
    role="Data Collector",
    goal="Collect and structure raw numeric data for analysis",
    backstory="You gather data and format it clearly for downstream processing.",
    verbose=True,
)

# Agent 2: Runs computations on whatever the researcher produces
analyst = Agent(
    role="Quantitative Analyst",
    goal="Execute Python to derive statistics from provided data",
    backstory=(
        "You receive structured data and run Python code to compute results. "
        "Always use the code interpreter. Never estimate."
    ),
    tools=[interpreter],
    verbose=True,
)

collect_task = Task(
    description=(
        "Generate a list of 10 realistic daily sales figures (integers between "
        "100 and 5000) for a SaaS product. Return them as a Python list."
    ),
    expected_output="A Python list of 10 integers, e.g. [412, 3100, ...]",
    agent=researcher,
)

compute_task = Task(
    description=(
        "Using the sales data from the previous task, write and execute Python code to: "
        "1) calculate total revenue, 2) find the highest and lowest day, "
        "3) compute the 7-day rolling average. Return all values."
    ),
    expected_output=(
        "Total revenue, min/max days, and a list of rolling averages — all computed "
        "by running Python code, not estimated."
    ),
    agent=analyst,
    context=[collect_task],  # passes researcher output into this task
)

crew = Crew(
    agents=[researcher, analyst],
    tasks=[collect_task, compute_task],
    process=Process.sequential,
    verbose=True,
)

result = crew.kickoff()
print(result)

The context=[collect_task] line is what passes the researcher's output as context to the analyst task — no manual wiring needed.

Verification

Run this end-to-end smoke test to confirm the interpreter executes correctly:

from crewai_tools import CodeInterpreterTool

tool = CodeInterpreterTool()

# Direct tool call — bypasses the agent layer
output = tool.run(code="print(2 ** 10)")
print(output)

You should see:

If you see 1024, the interpreter subprocess is working. If you see an error, check your Python version (python --version should be 3.11+) and that crewai-tools is installed in the same environment.

Production Considerations

Sandbox isolation. CodeInterpreterTool runs in a subprocess but shares your filesystem and environment variables. For production, wrap execution in Docker or use a dedicated sandbox service (e.g., E2B) to prevent agents from writing to disk or leaking secrets.

Token cost. Each code execution round-trip sends the generated code + output back through the LLM. On GPT-4o, a 50-line code block + 20-line output adds ~300 tokens per tool call. Budget accordingly for agents that iterate on failures.

Error loops. If the generated code raises an exception, the agent sees the traceback and retries. Set max_iter on the agent to cap retries:

analyst = Agent(
    ...
    max_iter=3,  # stop after 3 failed attempts instead of looping forever
)

Stateless execution. Each tool.run() call starts a fresh subprocess. Variables don't persist between calls. If your agent needs to build state across multiple code calls, pass the full data in each call or serialize intermediate results to a string.

What You Learned

CodeInterpreterTool from crewai-tools adds sandboxed Python execution to any agent
Task descriptions must explicitly instruct the agent to use the tool — vague prompts cause the LLM to hallucinate results
context=[previous_task] chains outputs between agents without manual string passing
Production use requires additional sandbox isolation — the default subprocess shares your environment

When NOT to use this: If you only need the agent to generate code for a human to run later, skip the interpreter. Use it only when computed results need to feed back into the workflow automatically.

Tested on CrewAI 0.80.0, crewai-tools 0.20.0, Python 3.12, macOS Sequoia & Ubuntu 24.04