Problem: You Need Deep System Visibility Without Breaking Production
Your app is slow in production, but traditional monitoring shows nothing. You need kernel-level tracing without adding overhead or deploying heavy agents.
You'll learn:
- How to generate eBPF scripts using AI (Claude/GPT-4)
- Safe ways to trace production systems
- Real-world debugging scenarios with working code
Time: 20 min | Level: Intermediate
Why eBPF + AI Works
eBPF runs in the kernel with near-zero overhead (<1% CPU). But writing eBPF is hard—AI can generate scripts from plain English descriptions.
Common use cases:
- Trace slow database queries without app instrumentation
- Find which syscalls cause latency spikes
- Monitor network packets in real-time
- Debug file I/O bottlenecks
Requirements: Linux kernel 4.18+ (5.x+ recommended), root access
Solution
Step 1: Install BPFtrace
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y bpftrace linux-headers-$(uname -r)
# Verify installation
sudo bpftrace --version
Expected: Should show v0.20.0 or newer
If it fails:
- "Package not found": Enable universe repository:
sudo add-apt-repository universe - Kernel too old: Check
uname -r- must be 4.18+
Step 2: Generate Your First Script with AI
Prompt for Claude/GPT-4:
Write a bpftrace script that:
1. Traces all file opens in /var/log
2. Shows the process name and PID
3. Prints the filename being opened
4. Runs for 10 seconds then exits
AI Output (validated):
#!/usr/bin/env bpftrace
// This traces openat syscalls targeting /var/log
// Safe for production - read-only, no modifications
BEGIN {
printf("Tracing file opens in /var/log... Hit Ctrl-C or wait 10s\n");
}
tracepoint:syscalls:sys_enter_openat /str(args->filename, 256) ~ "^/var/log"/ {
// Only match paths starting with /var/log
printf("%-16s %-6d %s\n", comm, pid, str(args->filename));
}
interval:s:10 {
exit();
}
END {
printf("\nTracing complete.\n");
}
Why this works: The regex filter prevents excessive output. The interval auto-exits to avoid forgetting about running scripts.
Step 3: Run It Safely
# Save script
cat > trace_logs.bt << 'EOF'
[paste AI-generated script]
EOF
# Run with timeout (safety measure)
sudo timeout 10s bpftrace trace_logs.bt
You should see:
Tracing file opens in /var/log... Hit Ctrl-C or wait 10s
systemd-journal 1234 /var/log/syslog
nginx 5678 /var/log/nginx/access.log
dockerd 9012 /var/log/docker.log
Tracing complete.
Step 4: Advanced Example - Find Slow Database Queries
AI Prompt:
Write a bpftrace script that:
- Traces PostgreSQL query execution time
- Only shows queries taking >100ms
- Includes the query text and duration
- Groups by query pattern
Generated Script:
#!/usr/bin/env bpftrace
// Traces PostgreSQL query latency (requires postgres process)
// Uses USDT probes if available, fallback to uprobe
BEGIN {
printf("Tracing PostgreSQL queries >100ms...\n");
printf("%-20s %-10s %s\n", "DURATION", "PID", "QUERY");
}
usdt:/usr/lib/postgresql/*/bin/postgres:postgresql:query__start {
@start[tid] = nsecs;
@query[tid] = str(arg0);
}
usdt:/usr/lib/postgresql/*/bin/postgres:postgresql:query__done /@start[tid]/ {
$duration_ms = (nsecs - @start[tid]) / 1000000;
// Only show slow queries
if ($duration_ms > 100) {
printf("%-20d %-10d %s\n",
$duration_ms,
pid,
@query[tid]);
}
delete(@start[tid]);
delete(@query[tid]);
}
END {
clear(@start);
clear(@query);
}
Run it:
sudo bpftrace postgres_slow.bt
If USDT probes aren't available:
# Check if postgres has probes
readelf -n /usr/lib/postgresql/*/bin/postgres | grep -A 10 "NT_STAPSDT"
# If empty, use this simpler version instead
Fallback (no USDT required):
#!/usr/bin/env bpftrace
// Traces postgres exec_simple_query function
// Works on postgres 12-16 without USDT
uprobe:/usr/lib/postgresql/*/bin/postgres:exec_simple_query {
@start[tid] = nsecs;
}
uretprobe:/usr/lib/postgresql/*/bin/postgres:exec_simple_query /@start[tid]/ {
$duration_ms = (nsecs - @start[tid]) / 1000000;
if ($duration_ms > 100) {
printf("%dms - PID %d\n", $duration_ms, pid);
}
delete(@start[tid]);
}
Step 5: Production-Safe Patterns
Always include these safety measures:
// 1. Auto-exit after time limit
interval:s:30 { exit(); }
// 2. Rate limiting (max 100 events/sec)
BEGIN { @count = 0; }
tracepoint:... {
@count++;
if (@count > 100) { exit(); }
}
// 3. Filter to specific PIDs
tracepoint:... /pid == $1/ { ... }
// 4. Aggregate instead of printing every event
tracepoint:... {
@calls[comm] = count();
}
END {
print(@calls);
clear(@calls);
}
Run with PID filter:
# Find your app's PID first
pgrep myapp
# Trace only that process
sudo bpftrace -p $(pgrep myapp) script.bt
Verification
Test your eBPF script is safe and working:
# 1. Syntax check
sudo bpftrace -d script.bt
# 2. Check overhead (should be <1% CPU)
top -p $(pgrep bpftrace)
# 3. Verify output makes sense
sudo bpftrace script.bt | head -20
You should see: Clean output, low CPU usage, auto-exits after timeout
Real-World AI Prompts That Work
Network Latency
Write a bpftrace script that measures TCP connection latency
from SYN to ACK for connections to port 443. Show the
destination IP and time in milliseconds.
Memory Allocations
Create a bpftrace script that tracks memory allocations >1MB
by process name. Print total memory allocated per process
when the script exits.
File I/O Patterns
Generate a bpftrace script that shows which files are being
read vs written, grouped by process. Include read/write byte
counts.
AI Tips:
- Specify kernel version if needed: "for Linux 5.15"
- Request safety features: "with auto-exit after 60 seconds"
- Ask for filters: "only trace processes owned by user nginx"
What You Learned
- eBPF provides kernel-level visibility with <1% overhead
- AI can generate production-ready scripts from plain English
- Always include timeouts and rate limits for safety
- USDT probes give better data but aren't always available
Limitations:
- Requires root access
- eBPF programs have size limits (~512KB)
- Some kernel functions aren't safely traceable
- AI-generated scripts need validation before production
Debugging Tips
Script won't attach:
# Check if probe exists
sudo bpftrace -l 'tracepoint:syscalls:sys_enter_*'
# List all available probes
sudo bpftrace -l | grep openat
No output:
# Add debug prints
BEGIN { printf("Script started\n"); }
# Check if filter is too restrictive
# Remove filters one by one to isolate issue
"Permission denied":
# Need root or CAP_BPF + CAP_PERFMON (kernel 5.8+)
sudo setcap cap_bpf,cap_perfmon+ep $(which bpftrace)
Production Checklist
- Script auto-exits after <60 seconds
- Filters limit scope (specific PID/path/port)
- Output is rate-limited or aggregated
- Tested on staging first
- Someone is monitoring the monitoring (meta!)
- Documented why this trace is running
- Cleanup script exists (
sudo pkill bpftrace)
Tested on Ubuntu 22.04 LTS (kernel 5.15), bpftrace 0.20.4, Claude Sonnet 4
Resources: