Problem: Running LLM-Generated Code Safely Without Extra Infrastructure

You want your AI app to generate Python, run it, and return real results — not just code blocks the user has to copy-paste. Setting up a container, an execution engine, or a code interpreter service adds complexity and cost.

Gemini 2.0 has code execution built directly into the API as a first-party tool. You enable it with one flag, and the model handles generating, running, and returning output — all inside Google's sandboxed environment.

You'll learn:

How to enable code execution on gemini-2.0-flash and gemini-2.0-pro
How to parse execution results and display generated code + output
Real-world patterns: data analysis, math, file-free computation

Time: 20 min | Difficulty: Intermediate

Why This Matters in 2026

Most "AI coding" tools stop at generation. Code execution closes the loop: the model writes code, runs it, reads the output, and can iterate — all within a single API call. This is the foundation of reliable data analysis agents, auto-graders, and calculation-heavy assistants.

Gemini 2.0's sandbox runs CPython with a standard scientific stack (NumPy, Pandas, Matplotlib, SciPy). It's stateless per request and has no network access — which is exactly what you want for untrusted execution.

How Code Execution Works

When you pass tools=[{"code_execution": {}}], the model can emit a special executable_code part mid-response. Google's infrastructure runs it, captures stdout and any error, then returns an code_execution_result part. The model reads that result and continues generating its final answer.

Your prompt
    │
    ▼
Gemini 2.0 ──generates──▶ executable_code block
                                │
                         Google Sandbox (CPython)
                                │
                         code_execution_result
                                │
                    Gemini reads result ──▶ Final text response

The round-trip is transparent to you — the full content array shows every step.

Solution

Step 1: Install the SDK

pip install google-generativeai

Verify:

python -c "import google.generativeai as genai; print(genai.__version__)"

Expected: 0.8.x or later.

Step 2: Configure the Client and Enable Code Execution

import google.generativeai as genai
import os

# Set your API key — get one at aistudio.google.com
genai.configure(api_key=os.environ["GEMINI_API_KEY"])

model = genai.GenerativeModel(
    model_name="gemini-2.0-flash",  # or gemini-2.0-pro
    tools=[{"code_execution": {}}],  # This is all you need
)

The code_execution tool requires no parameters. Google manages the sandbox entirely.

Step 3: Send a Request and Inspect the Full Response

response = model.generate_content(
    "Calculate the first 20 Fibonacci numbers and find which ones are prime."
)

# Iterate over all parts — the response contains code, output, AND text
for part in response.candidates[0].content.parts:
    if part.executable_code:
        print("=== Generated Code ===")
        print(part.executable_code.code)
    elif part.code_execution_result:
        print("=== Execution Output ===")
        print(part.code_execution_result.output)
    elif part.text:
        print("=== Model Response ===")
        print(part.text)

Expected output structure:

=== Generated Code ===
def is_prime(n):
    if n < 2: return False
    ...

=== Execution Output ===
Fibonacci primes up to F(20): [2, 3, 5, 13, 89, 233, 1597]

=== Model Response ===
Among the first 20 Fibonacci numbers, 7 are prime: 2, 3, 5, 13, 89, 233, and 1597...

Step 4: Handle Execution Errors Gracefully

The sandbox returns an outcome code you should always check:

import google.generativeai.types as genai_types

for part in response.candidates[0].content.parts:
    if part.code_execution_result:
        result = part.code_execution_result
        
        # outcome: OUTCOME_OK, OUTCOME_FAILED, OUTCOME_DEADLINE_EXCEEDED
        if result.outcome == genai_types.Outcome.OUTCOME_FAILED:
            print(f"Execution failed:\n{result.output}")
        elif result.outcome == genai_types.Outcome.OUTCOME_DEADLINE_EXCEEDED:
            print("Code timed out — sandbox has a 30s limit per execution")
        else:
            print(f"Output:\n{result.output}")

Common failure reasons:

Import errors → The sandbox has NumPy, Pandas, Matplotlib, SciPy, but NOT requests, Flask, or third-party packages
Timeout → Sandbox cap is 30 seconds; avoid training loops or infinite recursion
Memory → No hard number is published, but treat it like a 512MB budget

Step 5: Real-World Pattern — Data Analysis with Inline Data

Pass data as a string in your prompt. The model will parse it, write the analysis code, run it, and return interpreted results.

csv_data = """date,revenue,units
2026-01-01,12400,310
2026-01-02,9800,245
2026-01-03,15200,380
2026-01-04,11100,278
2026-01-05,18900,472"""

response = model.generate_content(
    f"Analyze this sales data and calculate day-over-day revenue growth rates:\n\n{csv_data}"
)

for part in response.candidates[0].content.parts:
    if part.executable_code:
        print(part.executable_code.code)
    elif part.code_execution_result:
        print(part.code_execution_result.output)
    elif part.text:
        print(part.text)

The model will import pandas, parse the CSV string, compute the growth rates, and return the calculation — not just a code block.

Step 6: Multi-Turn Conversation with Code Execution

Code execution works in chat sessions too. The model can iterate: run code, see the output, and refine.

chat = model.start_chat()

# Turn 1 — initial computation
r1 = chat.send_message("Generate 1000 random numbers from a normal distribution and calculate their mean and std.")

# Turn 2 — follow-up, model has context of previous code + output
r2 = chat.send_message("Now plot a histogram of those same numbers using matplotlib and describe the shape.")

for part in r2.candidates[0].content.parts:
    if part.executable_code:
        print("Code:", part.executable_code.code[:200], "...")
    elif part.text:
        print("Analysis:", part.text)

Note: each turn's sandbox is independent — variables don't persist between send_message calls. The model uses its context to re-generate setup code when needed.

Verification

Run this end-to-end check:

import google.generativeai as genai
import os

genai.configure(api_key=os.environ["GEMINI_API_KEY"])
model = genai.GenerativeModel("gemini-2.0-flash", tools=[{"code_execution": {}}])

response = model.generate_content("What is the sum of all prime numbers below 100?")

found_code = False
found_output = False

for part in response.candidates[0].content.parts:
    if part.executable_code:
        found_code = True
    if part.code_execution_result:
        found_output = True
        print("Sandbox output:", part.code_execution_result.output.strip())

assert found_code and found_output, "Code execution did not trigger"
print("✅ Code execution working correctly")

You should see:

Sandbox output: 1060
✅ Code execution working correctly

What You Learned

Enable the sandbox with one line: tools=[{"code_execution": {}}]
Responses contain three part types: executable_code, code_execution_result, and text — always iterate all parts
The sandbox has NumPy, Pandas, Matplotlib, and SciPy but no network and no third-party installs
Variables don't persist between turns; the model rewrites setup code using chat context
Always check result.outcome — silent failures return OUTCOME_FAILED with the traceback in output

When NOT to use this: If your use case needs persistent state, file I/O, or custom packages, you need a proper execution environment (e.g., E2B, Modal, or your own container). Code execution is ideal for stateless computation, data analysis, and math — not for building artifacts or running long jobs.

Tested on google-generativeai 0.8.3, gemini-2.0-flash and gemini-2.0-pro, Python 3.12, macOS and Ubuntu 24.04