Monitor Production Systems with eBPF + AI in 20 Minutes

Use AI-generated eBPF scripts to trace kernel events, debug performance issues, and monitor production systems without overhead.

Problem: You Need Deep System Visibility Without Breaking Production

Your app is slow in production, but traditional monitoring shows nothing. You need kernel-level tracing without adding overhead or deploying heavy agents.

You'll learn:

  • How to generate eBPF scripts using AI (Claude/GPT-4)
  • Safe ways to trace production systems
  • Real-world debugging scenarios with working code

Time: 20 min | Level: Intermediate


Why eBPF + AI Works

eBPF runs in the kernel with near-zero overhead (<1% CPU). But writing eBPF is hard—AI can generate scripts from plain English descriptions.

Common use cases:

  • Trace slow database queries without app instrumentation
  • Find which syscalls cause latency spikes
  • Monitor network packets in real-time
  • Debug file I/O bottlenecks

Requirements: Linux kernel 4.18+ (5.x+ recommended), root access


Solution

Step 1: Install BPFtrace

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y bpftrace linux-headers-$(uname -r)

# Verify installation
sudo bpftrace --version

Expected: Should show v0.20.0 or newer

If it fails:

  • "Package not found": Enable universe repository: sudo add-apt-repository universe
  • Kernel too old: Check uname -r - must be 4.18+

Step 2: Generate Your First Script with AI

Prompt for Claude/GPT-4:

Write a bpftrace script that:
1. Traces all file opens in /var/log
2. Shows the process name and PID
3. Prints the filename being opened
4. Runs for 10 seconds then exits

AI Output (validated):

#!/usr/bin/env bpftrace

// This traces openat syscalls targeting /var/log
// Safe for production - read-only, no modifications

BEGIN {
    printf("Tracing file opens in /var/log... Hit Ctrl-C or wait 10s\n");
}

tracepoint:syscalls:sys_enter_openat /str(args->filename, 256) ~ "^/var/log"/ {
    // Only match paths starting with /var/log
    printf("%-16s %-6d %s\n", comm, pid, str(args->filename));
}

interval:s:10 {
    exit();
}

END {
    printf("\nTracing complete.\n");
}

Why this works: The regex filter prevents excessive output. The interval auto-exits to avoid forgetting about running scripts.


Step 3: Run It Safely

# Save script
cat > trace_logs.bt << 'EOF'
[paste AI-generated script]
EOF

# Run with timeout (safety measure)
sudo timeout 10s bpftrace trace_logs.bt

You should see:

Tracing file opens in /var/log... Hit Ctrl-C or wait 10s
systemd-journal  1234   /var/log/syslog
nginx            5678   /var/log/nginx/access.log
dockerd          9012   /var/log/docker.log

Tracing complete.

Step 4: Advanced Example - Find Slow Database Queries

AI Prompt:

Write a bpftrace script that:
- Traces PostgreSQL query execution time
- Only shows queries taking >100ms
- Includes the query text and duration
- Groups by query pattern

Generated Script:

#!/usr/bin/env bpftrace

// Traces PostgreSQL query latency (requires postgres process)
// Uses USDT probes if available, fallback to uprobe

BEGIN {
    printf("Tracing PostgreSQL queries >100ms...\n");
    printf("%-20s %-10s %s\n", "DURATION", "PID", "QUERY");
}

usdt:/usr/lib/postgresql/*/bin/postgres:postgresql:query__start {
    @start[tid] = nsecs;
    @query[tid] = str(arg0);
}

usdt:/usr/lib/postgresql/*/bin/postgres:postgresql:query__done /@start[tid]/ {
    $duration_ms = (nsecs - @start[tid]) / 1000000;
    
    // Only show slow queries
    if ($duration_ms > 100) {
        printf("%-20d %-10d %s\n", 
               $duration_ms, 
               pid, 
               @query[tid]);
    }
    
    delete(@start[tid]);
    delete(@query[tid]);
}

END {
    clear(@start);
    clear(@query);
}

Run it:

sudo bpftrace postgres_slow.bt

If USDT probes aren't available:

# Check if postgres has probes
readelf -n /usr/lib/postgresql/*/bin/postgres | grep -A 10 "NT_STAPSDT"

# If empty, use this simpler version instead

Fallback (no USDT required):

#!/usr/bin/env bpftrace

// Traces postgres exec_simple_query function
// Works on postgres 12-16 without USDT

uprobe:/usr/lib/postgresql/*/bin/postgres:exec_simple_query {
    @start[tid] = nsecs;
}

uretprobe:/usr/lib/postgresql/*/bin/postgres:exec_simple_query /@start[tid]/ {
    $duration_ms = (nsecs - @start[tid]) / 1000000;
    
    if ($duration_ms > 100) {
        printf("%dms - PID %d\n", $duration_ms, pid);
    }
    
    delete(@start[tid]);
}

Step 5: Production-Safe Patterns

Always include these safety measures:

// 1. Auto-exit after time limit
interval:s:30 { exit(); }

// 2. Rate limiting (max 100 events/sec)
BEGIN { @count = 0; }
tracepoint:... {
    @count++;
    if (@count > 100) { exit(); }
}

// 3. Filter to specific PIDs
tracepoint:... /pid == $1/ { ... }

// 4. Aggregate instead of printing every event
tracepoint:... {
    @calls[comm] = count();
}
END {
    print(@calls);
    clear(@calls);
}

Run with PID filter:

# Find your app's PID first
pgrep myapp

# Trace only that process
sudo bpftrace -p $(pgrep myapp) script.bt

Verification

Test your eBPF script is safe and working:

# 1. Syntax check
sudo bpftrace -d script.bt

# 2. Check overhead (should be <1% CPU)
top -p $(pgrep bpftrace)

# 3. Verify output makes sense
sudo bpftrace script.bt | head -20

You should see: Clean output, low CPU usage, auto-exits after timeout


Real-World AI Prompts That Work

Network Latency

Write a bpftrace script that measures TCP connection latency 
from SYN to ACK for connections to port 443. Show the 
destination IP and time in milliseconds.

Memory Allocations

Create a bpftrace script that tracks memory allocations >1MB 
by process name. Print total memory allocated per process 
when the script exits.

File I/O Patterns

Generate a bpftrace script that shows which files are being 
read vs written, grouped by process. Include read/write byte 
counts.

AI Tips:

  • Specify kernel version if needed: "for Linux 5.15"
  • Request safety features: "with auto-exit after 60 seconds"
  • Ask for filters: "only trace processes owned by user nginx"

What You Learned

  • eBPF provides kernel-level visibility with <1% overhead
  • AI can generate production-ready scripts from plain English
  • Always include timeouts and rate limits for safety
  • USDT probes give better data but aren't always available

Limitations:

  • Requires root access
  • eBPF programs have size limits (~512KB)
  • Some kernel functions aren't safely traceable
  • AI-generated scripts need validation before production

Debugging Tips

Script won't attach:

# Check if probe exists
sudo bpftrace -l 'tracepoint:syscalls:sys_enter_*'

# List all available probes
sudo bpftrace -l | grep openat

No output:

# Add debug prints
BEGIN { printf("Script started\n"); }

# Check if filter is too restrictive
# Remove filters one by one to isolate issue

"Permission denied":

# Need root or CAP_BPF + CAP_PERFMON (kernel 5.8+)
sudo setcap cap_bpf,cap_perfmon+ep $(which bpftrace)

Production Checklist

  • Script auto-exits after <60 seconds
  • Filters limit scope (specific PID/path/port)
  • Output is rate-limited or aggregated
  • Tested on staging first
  • Someone is monitoring the monitoring (meta!)
  • Documented why this trace is running
  • Cleanup script exists (sudo pkill bpftrace)

Tested on Ubuntu 22.04 LTS (kernel 5.15), bpftrace 0.20.4, Claude Sonnet 4

Resources: