Track LLM Costs: Per-Request Budget Alerts in Python 2026

Problem: LLM Bills Spike With No Warning

LLM cost tracking with per-request budget alerts is the fastest way to stop surprise $400 OpenAI invoices from hitting your credit card at the end of the month.

You shipped a feature. A user hit it 10,000 times overnight with long prompts. You had no idea until billing ran.

You'll learn:

How to extract token usage from OpenAI, Anthropic, and Gemini responses
How to calculate per-request USD cost using a live pricing table
How to fire an alert (log, raise, Slack webhook) when a single call exceeds your threshold

Time: 20 min | Difficulty: Intermediate

Why This Happens

LLM APIs charge per token. A single GPT-4o call with a 10k-token context costs ~$0.025. That sounds small — until an unguarded loop runs it 50,000 times.

Symptoms:

Monthly invoice is 10× higher than expected
No per-endpoint visibility into which feature is expensive
No alerting when a single request goes rogue (huge prompt, runaway agent loop)

The fix is a thin cost-tracking wrapper: intercept every API response, read usage.total_tokens, multiply by the model's per-token rate, and compare against your threshold.

LLM cost tracking per-request budget alert pipeline in Python Flow: every LLM call → token extractor → cost calculator → threshold check → alert or pass-through

Solution

Step 1: Install Dependencies

This guide uses uv (Python 3.12). The only required library is the SDK for whichever provider you use. No LangChain needed.

uv init llm-cost-tracker
cd llm-cost-tracker

# Install provider SDKs you need — pick one or all
uv add openai anthropic google-generativeai

# Optional: httpx for Slack webhook alerts
uv add httpx

Expected output: Resolved X packages in Xs

Step 2: Build the Pricing Table

Hard-code the per-token rates for every model you call. Update this table when providers change pricing (check their pricing pages monthly).

Prices below are in USD as of March 2026. OpenAI GPT-4o is $2.50 per 1M input tokens / $10.00 per 1M output tokens.

# costs.py

# All prices in USD per token (not per 1M)
# Source: provider pricing pages, March 2026
MODEL_COSTS: dict[str, dict[str, float]] = {
    # OpenAI
    "gpt-4o": {
        "input": 2.50 / 1_000_000,   # $2.50 per 1M input tokens
        "output": 10.00 / 1_000_000,  # $10.00 per 1M output tokens
    },
    "gpt-4o-mini": {
        "input": 0.15 / 1_000_000,
        "output": 0.60 / 1_000_000,
    },
    "gpt-4.1": {
        "input": 2.00 / 1_000_000,
        "output": 8.00 / 1_000_000,
    },
    # Anthropic
    "claude-opus-4-5": {
        "input": 15.00 / 1_000_000,
        "output": 75.00 / 1_000_000,
    },
    "claude-sonnet-4-5": {
        "input": 3.00 / 1_000_000,
        "output": 15.00 / 1_000_000,
    },
    "claude-haiku-3-5": {
        "input": 0.80 / 1_000_000,
        "output": 4.00 / 1_000_000,
    },
    # Google
    "gemini-2.0-flash": {
        "input": 0.10 / 1_000_000,
        "output": 0.40 / 1_000_000,
    },
    "gemini-2.5-pro": {
        "input": 1.25 / 1_000_000,
        "output": 10.00 / 1_000_000,
    },
}


def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    """Return USD cost for one LLM API call."""
    rates = MODEL_COSTS.get(model)
    if rates is None:
        # Unknown model — fail loudly rather than silently under-count
        raise ValueError(f"No pricing data for model '{model}'. Add it to MODEL_COSTS.")
    return (input_tokens * rates["input"]) + (output_tokens * rates["output"])

Step 3: Build the Budget Alert Layer

This is a dataclass that holds the result of every tracked call. The alert method is where you plug in your notification channel.

# tracker.py
from dataclasses import dataclass, field
from datetime import datetime, timezone
import logging

logger = logging.getLogger(__name__)


@dataclass
class LLMCallRecord:
    model: str
    input_tokens: int
    output_tokens: int
    cost_usd: float
    timestamp: datetime = field(default_factory=lambda: datetime.now(timezone.utc))
    endpoint: str = "unknown"  # tag with feature name for per-endpoint breakdown


class BudgetTracker:
    def __init__(self, alert_threshold_usd: float = 0.10):
        # Fire an alert if a single request costs more than this
        self.threshold = alert_threshold_usd
        self.records: list[LLMCallRecord] = []

    def record(self, record: LLMCallRecord) -> None:
        self.records.append(record)
        if record.cost_usd > self.threshold:
            self._alert(record)

    def _alert(self, record: LLMCallRecord) -> None:
        """Override this method to send Slack/PagerDuty/email alerts."""
        logger.warning(
            "BUDGET ALERT | endpoint=%s model=%s cost=$%.4f tokens_in=%d tokens_out=%d",
            record.endpoint,
            record.model,
            record.cost_usd,
            record.input_tokens,
            record.output_tokens,
        )

    def total_cost(self) -> float:
        return sum(r.cost_usd for r in self.records)

    def summary(self) -> dict:
        return {
            "total_requests": len(self.records),
            "total_cost_usd": round(self.total_cost(), 6),
            "alerts_fired": sum(1 for r in self.records if r.cost_usd > self.threshold),
        }

Step 4: Wrap OpenAI Calls

# openai_wrapper.py
from openai import OpenAI
from costs import calculate_cost
from tracker import BudgetTracker, LLMCallRecord

client = OpenAI()  # reads OPENAI_API_KEY from env


def chat(
    messages: list[dict],
    model: str = "gpt-4o-mini",
    tracker: BudgetTracker | None = None,
    endpoint: str = "unknown",
) -> str:
    response = client.chat.completions.create(model=model, messages=messages)

    if tracker is not None:
        usage = response.usage  # usage.prompt_tokens, usage.completion_tokens
        cost = calculate_cost(model, usage.prompt_tokens, usage.completion_tokens)
        tracker.record(
            LLMCallRecord(
                model=model,
                input_tokens=usage.prompt_tokens,
                output_tokens=usage.completion_tokens,
                cost_usd=cost,
                endpoint=endpoint,
            )
        )

    # Return the text content of the first choice
    return response.choices[0].message.content

Step 5: Wrap Anthropic Calls

Anthropic's response object uses usage.input_tokens and usage.output_tokens — different field names from OpenAI.

# anthropic_wrapper.py
import anthropic
from costs import calculate_cost
from tracker import BudgetTracker, LLMCallRecord

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env


def chat(
    prompt: str,
    model: str = "claude-haiku-3-5",
    tracker: BudgetTracker | None = None,
    endpoint: str = "unknown",
) -> str:
    response = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    )

    if tracker is not None:
        usage = response.usage  # usage.input_tokens, usage.output_tokens
        cost = calculate_cost(model, usage.input_tokens, usage.output_tokens)
        tracker.record(
            LLMCallRecord(
                model=model,
                input_tokens=usage.input_tokens,
                output_tokens=usage.output_tokens,
                cost_usd=cost,
                endpoint=endpoint,
            )
        )

    return response.content[0].text

Step 6: Add Slack Webhook Alerts (Optional)

Override _alert in BudgetTracker to post to Slack when a request exceeds your threshold.

# slack_tracker.py
import httpx
from tracker import BudgetTracker, LLMCallRecord


class SlackBudgetTracker(BudgetTracker):
    def __init__(self, alert_threshold_usd: float, webhook_url: str):
        super().__init__(alert_threshold_usd)
        self.webhook_url = webhook_url

    def _alert(self, record: LLMCallRecord) -> None:
        # Also call parent so the warning still appears in logs
        super()._alert(record)

        payload = {
            "text": (
                f":rotating_light: *LLM Budget Alert*\n"
                f"• Endpoint: `{record.endpoint}`\n"
                f"• Model: `{record.model}`\n"
                f"• Cost: `${record.cost_usd:.4f}` (threshold: `${self.threshold:.4f}`)\n"
                f"• Tokens: {record.input_tokens} in / {record.output_tokens} out"
            )
        }

        # Fire-and-forget — don't block the main request on Slack latency
        try:
            httpx.post(self.webhook_url, json=payload, timeout=3.0)
        except httpx.TimeoutException:
            pass  # Slack is down — don't crash the app over a missed alert

Step 7: Wire It Into Your App

# main.py
import logging
import os
from slack_tracker import SlackBudgetTracker
from openai_wrapper import chat as openai_chat
from anthropic_wrapper import chat as anthropic_chat

logging.basicConfig(level=logging.INFO)

tracker = SlackBudgetTracker(
    alert_threshold_usd=0.05,  # Alert on any single call over $0.05
    webhook_url=os.environ["SLACK_WEBHOOK_URL"],
)

# Simulate two feature endpoints
response_a = openai_chat(
    messages=[{"role": "user", "content": "Summarize the history of the Roman Empire in detail."}],
    model="gpt-4o",
    tracker=tracker,
    endpoint="summarize",
)

response_b = anthropic_chat(
    prompt="What is 2 + 2?",
    model="claude-haiku-3-5",
    tracker=tracker,
    endpoint="math-tutor",
)

print(tracker.summary())
# {'total_requests': 2, 'total_cost_usd': 0.00382, 'alerts_fired': 0}

Verification

Run the app and check for the summary dict and any budget warnings:

uv run python main.py

You should see:

INFO:tracker:Recording call: endpoint=summarize model=gpt-4o cost=$0.0031
INFO:tracker:Recording call: endpoint=math-tutor model=claude-haiku-3-5 cost=$0.0001
{'total_requests': 2, 'total_cost_usd': 0.003200, 'alerts_fired': 0}

To trigger a test alert, temporarily set alert_threshold_usd=0.000001 and run again. You should see a WARNING: BUDGET ALERT line (and a Slack message if the webhook is configured).

Comparison: DIY Tracking vs Observability Platforms

	This approach	LangSmith	Helicone
Setup time	~20 min	~10 min	~5 min
Per-request cost	$0	Free tier, then $39+/mo	Free tier, then $20+/mo
Self-hosted	✅	❌	❌
Custom alert logic	✅ Full control	❌ Limited	❌ Limited
Works without LangChain	✅	⚠️ Better with it	✅
Stores prompt/response logs	❌ (add your own)	✅	✅

Choose this DIY approach if: you want zero vendor lock-in, need custom alert routing, or are running in an air-gapped environment (SOC 2 / HIPAA workloads on AWS us-east-1).

Choose LangSmith or Helicone if: you want prompt replay, dataset management, and a UI dashboard without building it yourself.

What You Learned

response.usage on OpenAI returns prompt_tokens / completion_tokens; Anthropic returns input_tokens / output_tokens — they are not the same field names
A per-token pricing table plus a small dataclass gives you full cost visibility with no third-party dependency
Subclassing BudgetTracker._alert lets you swap notification channels (log → Slack → PagerDuty) without changing any call sites

Tested on Python 3.12, openai 1.75, anthropic 0.49, uv 0.6.x, macOS Sequoia & Ubuntu 24.04

FAQ

Q: Does this work with streaming responses? A: Yes, but you must wait for the stream to finish before reading usage. With OpenAI streaming, pass stream_options={"include_usage": True} and read usage from the final chunk. With Anthropic streaming, usage is in the message_delta event at the end of the stream.

Q: What if a model isn't in my pricing table? A: calculate_cost raises a ValueError immediately. This is intentional — silently returning $0 would make your cost reports wrong. Add the model's rates to MODEL_COSTS before calling it.

Q: Can I track costs across multiple workers or processes? A: The BudgetTracker in this guide is in-process only. For multi-process apps, write records to a shared store — PostgreSQL, Redis, or a time-series DB like InfluxDB. Use a background thread or async task to flush records so you don't add latency to the request path.

Q: How do I get per-endpoint cost breakdowns? A: Filter tracker.records by the endpoint field: [r for r in tracker.records if r.endpoint == "summarize"]. Sum their cost_usd values. For production, group by endpoint and write daily rollups to your DB.

Q: Does this handle cached tokens (OpenAI prompt caching)? A: OpenAI's prompt caching returns a prompt_tokens_details.cached_tokens field. Cached tokens are billed at 50% of the normal input rate. Add this to calculate_cost to account for it: cached = usage.prompt_tokens_details.cached_tokens or 0 and price the cached portion at rates["input"] * 0.5.