Monitor Production Systems with eBPF + AI in 20 Minutes

Problem: You Need Deep System Visibility Without Breaking Production

Your app is slow in production, but traditional monitoring shows nothing. You need kernel-level tracing without adding overhead or deploying heavy agents.

You'll learn:

How to generate eBPF scripts using AI (Claude/GPT-4)
Safe ways to trace production systems
Real-world debugging scenarios with working code

Time: 20 min | Level: Intermediate

Why eBPF + AI Works

eBPF runs in the kernel with near-zero overhead (<1% CPU). But writing eBPF is hard—AI can generate scripts from plain English descriptions.

Common use cases:

Trace slow database queries without app instrumentation
Find which syscalls cause latency spikes
Monitor network packets in real-time
Debug file I/O bottlenecks

Requirements: Linux kernel 4.18+ (5.x+ recommended), root access

Solution

Step 1: Install BPFtrace

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y bpftrace linux-headers-$(uname -r)

# Verify installation
sudo bpftrace --version

Expected: Should show v0.20.0 or newer

If it fails:

"Package not found": Enable universe repository: sudo add-apt-repository universe
Kernel too old: Check uname -r - must be 4.18+

Step 2: Generate Your First Script with AI

Prompt for Claude/GPT-4:

Write a bpftrace script that:
1. Traces all file opens in /var/log
2. Shows the process name and PID
3. Prints the filename being opened
4. Runs for 10 seconds then exits

AI Output (validated):

#!/usr/bin/env bpftrace

// This traces openat syscalls targeting /var/log
// Safe for production - read-only, no modifications

BEGIN {
    printf("Tracing file opens in /var/log... Hit Ctrl-C or wait 10s\n");
}

tracepoint:syscalls:sys_enter_openat /str(args->filename, 256) ~ "^/var/log"/ {
    // Only match paths starting with /var/log
    printf("%-16s %-6d %s\n", comm, pid, str(args->filename));
}

interval:s:10 {
    exit();
}

END {
    printf("\nTracing complete.\n");
}

Why this works: The regex filter prevents excessive output. The interval auto-exits to avoid forgetting about running scripts.

Step 3: Run It Safely

# Save script
cat > trace_logs.bt << 'EOF'
[paste AI-generated script]
EOF

# Run with timeout (safety measure)
sudo timeout 10s bpftrace trace_logs.bt

You should see:

Tracing file opens in /var/log... Hit Ctrl-C or wait 10s
systemd-journal  1234   /var/log/syslog
nginx            5678   /var/log/nginx/access.log
dockerd          9012   /var/log/docker.log

Tracing complete.

Step 4: Advanced Example - Find Slow Database Queries

AI Prompt:

Write a bpftrace script that:
- Traces PostgreSQL query execution time
- Only shows queries taking >100ms
- Includes the query text and duration
- Groups by query pattern

Generated Script:

#!/usr/bin/env bpftrace

// Traces PostgreSQL query latency (requires postgres process)
// Uses USDT probes if available, fallback to uprobe

BEGIN {
    printf("Tracing PostgreSQL queries >100ms...\n");
    printf("%-20s %-10s %s\n", "DURATION", "PID", "QUERY");
}

usdt:/usr/lib/postgresql/*/bin/postgres:postgresql:query__start {
    @start[tid] = nsecs;
    @query[tid] = str(arg0);
}

usdt:/usr/lib/postgresql/*/bin/postgres:postgresql:query__done /@start[tid]/ {
    $duration_ms = (nsecs - @start[tid]) / 1000000;
    
    // Only show slow queries
    if ($duration_ms > 100) {
        printf("%-20d %-10d %s\n", 
               $duration_ms, 
               pid, 
               @query[tid]);
    }
    
    delete(@start[tid]);
    delete(@query[tid]);
}

END {
    clear(@start);
    clear(@query);
}

Run it:

sudo bpftrace postgres_slow.bt

If USDT probes aren't available:

# Check if postgres has probes
readelf -n /usr/lib/postgresql/*/bin/postgres | grep -A 10 "NT_STAPSDT"

# If empty, use this simpler version instead

Fallback (no USDT required):

#!/usr/bin/env bpftrace

// Traces postgres exec_simple_query function
// Works on postgres 12-16 without USDT

uprobe:/usr/lib/postgresql/*/bin/postgres:exec_simple_query {
    @start[tid] = nsecs;
}

uretprobe:/usr/lib/postgresql/*/bin/postgres:exec_simple_query /@start[tid]/ {
    $duration_ms = (nsecs - @start[tid]) / 1000000;
    
    if ($duration_ms > 100) {
        printf("%dms - PID %d\n", $duration_ms, pid);
    }
    
    delete(@start[tid]);
}

Step 5: Production-Safe Patterns

Always include these safety measures:

// 1. Auto-exit after time limit
interval:s:30 { exit(); }

// 2. Rate limiting (max 100 events/sec)
BEGIN { @count = 0; }
tracepoint:... {
    @count++;
    if (@count > 100) { exit(); }
}

// 3. Filter to specific PIDs
tracepoint:... /pid == $1/ { ... }

// 4. Aggregate instead of printing every event
tracepoint:... {
    @calls[comm] = count();
}
END {
    print(@calls);
    clear(@calls);
}

Run with PID filter:

# Find your app's PID first
pgrep myapp

# Trace only that process
sudo bpftrace -p $(pgrep myapp) script.bt

Verification

Test your eBPF script is safe and working:

# 1. Syntax check
sudo bpftrace -d script.bt

# 2. Check overhead (should be <1% CPU)
top -p $(pgrep bpftrace)

# 3. Verify output makes sense
sudo bpftrace script.bt | head -20

You should see: Clean output, low CPU usage, auto-exits after timeout

Real-World AI Prompts That Work

Network Latency

Write a bpftrace script that measures TCP connection latency 
from SYN to ACK for connections to port 443. Show the 
destination IP and time in milliseconds.

Memory Allocations

Create a bpftrace script that tracks memory allocations >1MB 
by process name. Print total memory allocated per process 
when the script exits.

File I/O Patterns

Generate a bpftrace script that shows which files are being 
read vs written, grouped by process. Include read/write byte 
counts.

AI Tips:

Specify kernel version if needed: "for Linux 5.15"
Request safety features: "with auto-exit after 60 seconds"
Ask for filters: "only trace processes owned by user nginx"

What You Learned

eBPF provides kernel-level visibility with <1% overhead
AI can generate production-ready scripts from plain English
Always include timeouts and rate limits for safety
USDT probes give better data but aren't always available

Limitations:

Requires root access
eBPF programs have size limits (~512KB)
Some kernel functions aren't safely traceable
AI-generated scripts need validation before production

Debugging Tips

Script won't attach:

# Check if probe exists
sudo bpftrace -l 'tracepoint:syscalls:sys_enter_*'

# List all available probes
sudo bpftrace -l | grep openat

No output:

# Add debug prints
BEGIN { printf("Script started\n"); }

# Check if filter is too restrictive
# Remove filters one by one to isolate issue

"Permission denied":

# Need root or CAP_BPF + CAP_PERFMON (kernel 5.8+)
sudo setcap cap_bpf,cap_perfmon+ep $(which bpftrace)

Production Checklist

Script auto-exits after <60 seconds
Filters limit scope (specific PID/path/port)
Output is rate-limited or aggregated
Tested on staging first
Someone is monitoring the monitoring (meta!)
Documented why this trace is running
Cleanup script exists (sudo pkill bpftrace)

Tested on Ubuntu 22.04 LTS (kernel 5.15), bpftrace 0.20.4, Claude Sonnet 4

Resources: