LangSmith Cost Analytics: Track LLM Spend by Feature

Problem: Your LLM Bill Is Growing but You Don't Know Why

You're paying $800/month for OpenAI tokens. You have five product features using LLMs. You have no idea which feature is burning 70% of the budget.

LangSmith captures token usage on every trace — but by default it's a flat list. Without metadata tags, you can't slice costs by feature, user segment, or environment.

You'll learn:

How to tag traces with metadata so every run is attributed to a feature
How to query aggregated token counts and estimated costs via the LangSmith SDK
How to build a repeatable cost-by-feature report you can run daily

Time: 20 min | Difficulty: Intermediate

Why Default Tracing Isn't Enough

LangSmith auto-captures prompt_tokens, completion_tokens, and total tokens on every LLM call. But without metadata, all traces land in the same project with no dimension to group by.

Symptoms you've hit this wall:

Dashboard shows total tokens but no breakdown by feature or team
You can filter by model, but not by "which product feature called this model"
Monthly cost spikes with no clear owner

The fix is two parts: tag every trace at invocation time, then query those tags in aggregate.

Solution

Step 1: Add Feature Metadata to Every Trace

LangSmith reads metadata from the LANGCHAIN_METADATA env var or from the metadata kwarg passed directly to your chain or runnable. The direct kwarg approach is more reliable in production.

# Requires: langchain>=0.2, langsmith>=0.1.70
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnableConfig

llm = ChatOpenAI(model="gpt-4o-mini")

config = RunnableConfig(
    metadata={
        "feature": "rag-search",        # The product feature billing dimension
        "env": "production",
        "user_tier": "pro",
    },
    tags=["rag-search", "production"],  # Tags for quick UI filtering
)

response = llm.invoke("Summarize this document", config=config)

For LangGraph agents, pass config through the graph invoke:

from langgraph.graph import StateGraph

# Build your graph as normal...
app = graph.compile()

result = app.invoke(
    {"messages": [("user", "Your query here")]},
    config=RunnableConfig(
        metadata={"feature": "agent-copilot", "env": "production"}
    ),
)

Expected: Traces appear in LangSmith with a Metadata panel showing your keys.

Step 2: Verify Tags Are Landing on Traces

Before building the aggregation query, confirm metadata is actually being stored.

from langsmith import Client

client = Client()

# Pull the 5 most recent runs for your project
runs = client.list_runs(
    project_name="your-project-name",
    run_type="llm",
    limit=5,
)

for run in runs:
    print(run.name, run.extra.get("metadata", {}))

Expected output:

ChatOpenAI {'feature': 'rag-search', 'env': 'production', 'user_tier': 'pro'}
ChatOpenAI {'feature': 'agent-copilot', 'env': 'production', 'user_tier': 'pro'}

If metadata is empty:

Using LCEL chains → Make sure config is forwarded through every .invoke() call, not just the top-level chain
Using legacy chain.run() → Switch to .invoke(input, config=config) — the old interface doesn't propagate config cleanly

Step 3: Query Token Costs by Feature

LangSmith's list_runs supports server-side filtering on metadata fields. Pull all LLM runs for a date range, group by feature, and calculate cost.

from langsmith import Client
from collections import defaultdict
from datetime import datetime, timedelta, timezone

client = Client()

# Token cost map — update these when OpenAI reprices
COST_PER_1K = {
    "gpt-4o":           {"input": 0.0025, "output": 0.010},
    "gpt-4o-mini":      {"input": 0.00015, "output": 0.0006},
    "gpt-4-turbo":      {"input": 0.010,  "output": 0.030},
    "claude-3-5-sonnet-20241022": {"input": 0.003, "output": 0.015},
}

def get_model_key(run_name: str) -> str:
    """Map LangSmith run names to cost table keys."""
    name = run_name.lower()
    if "gpt-4o-mini" in name:
        return "gpt-4o-mini"
    if "gpt-4o" in name:
        return "gpt-4o"
    if "gpt-4-turbo" in name:
        return "gpt-4-turbo"
    if "claude-3-5-sonnet" in name:
        return "claude-3-5-sonnet-20241022"
    return None

def calculate_cost(tokens_input: int, tokens_output: int, model_key: str) -> float:
    if model_key not in COST_PER_1K:
        return 0.0
    rates = COST_PER_1K[model_key]
    return (tokens_input / 1000 * rates["input"]) + (tokens_output / 1000 * rates["output"])

# Query: last 7 days of LLM runs
end = datetime.now(timezone.utc)
start = end - timedelta(days=7)

runs = client.list_runs(
    project_name="your-project-name",
    run_type="llm",
    start_time=start,
    end_time=end,
    filter='has(metadata, \'{"feature": ""}\')',  # Grab only tagged runs
)

feature_spend = defaultdict(lambda: {"input_tokens": 0, "output_tokens": 0, "cost": 0.0, "calls": 0})

for run in runs:
    feature = run.extra.get("metadata", {}).get("feature", "untagged")
    usage = run.extra.get("usage_metadata") or {}

    input_tokens  = usage.get("input_tokens", 0)
    output_tokens = usage.get("output_tokens", 0)
    model_key     = get_model_key(run.name or "")
    cost          = calculate_cost(input_tokens, output_tokens, model_key)

    feature_spend[feature]["input_tokens"]  += input_tokens
    feature_spend[feature]["output_tokens"] += output_tokens
    feature_spend[feature]["cost"]          += cost
    feature_spend[feature]["calls"]         += 1

# Print the report
print(f"\n{'Feature':<25} {'Calls':>8} {'Input Tok':>12} {'Output Tok':>12} {'Est. Cost':>12}")
print("-" * 73)
for feature, data in sorted(feature_spend.items(), key=lambda x: -x[1]["cost"]):
    print(
        f"{feature:<25} {data['calls']:>8} {data['input_tokens']:>12,} "
        f"{data['output_tokens']:>12,} ${data['cost']:>11.4f}"
    )

Expected output:

Feature                   Calls   Input Tok   Output Tok     Est. Cost
-------------------------------------------------------------------------
rag-search                  412     824,310      192,440       $2.2761
agent-copilot               118     603,200      411,820       $5.6594
summarizer                   87     198,400       44,100       $0.0561
email-drafts                 34      42,100       38,200       $0.0292
untagged                     19      31,200       12,800       $0.0094

Step 4: Export to CSV for Dashboards

Pipe the same data into a CSV for weekly stakeholder reports or Grafana ingestion.

import csv
from pathlib import Path

output_path = Path("llm_spend_by_feature.csv")

with output_path.open("w", newline="") as f:
    writer = csv.DictWriter(
        f,
        fieldnames=["feature", "calls", "input_tokens", "output_tokens", "estimated_cost_usd"],
    )
    writer.writeheader()
    for feature, data in feature_spend.items():
        writer.writerow({
            "feature":               feature,
            "calls":                 data["calls"],
            "input_tokens":          data["input_tokens"],
            "output_tokens":         data["output_tokens"],
            "estimated_cost_usd":    round(data["cost"], 4),
        })

print(f"Report written to {output_path}")

Schedule this with a cron job or GitHub Actions to run every morning at 06:00.

Step 5: Alert on Spend Spikes

Add a simple threshold check to catch runaway features before they compound.

DAILY_BUDGET_USD = {
    "rag-search":    2.00,
    "agent-copilot": 5.00,
    "summarizer":    0.50,
}

alerts = []
for feature, data in feature_spend.items():
    budget = DAILY_BUDGET_USD.get(feature)
    if budget and data["cost"] > budget:
        overage = data["cost"] - budget
        alerts.append(f"⚠️  {feature}: ${data['cost']:.2f} spent (+${overage:.2f} over budget)")

if alerts:
    print("\nCost Alerts:")
    for alert in alerts:
        print(alert)
else:
    print("\n✅ All features within budget.")

Wire alerts into Slack via a webhook or PagerDuty for production monitoring.

Verification

Run a quick sanity check: compare your script's total cost against LangSmith's UI.

total = sum(d["cost"] for d in feature_spend.values())
print(f"Total estimated spend (7d): ${total:.4f}")

Cross-reference this against LangSmith → Project → Usage tab. The numbers won't be identical because LangSmith's UI may use slightly different pricing snapshots — but they should be within 5%. If they're off by more than 20%, check that your COST_PER_1K table matches current OpenAI pricing.

What You Learned

metadata in RunnableConfig is the correct way to tag traces — not environment variables
client.list_runs() with start_time/end_time is efficient for date-range queries; avoid pulling all runs and filtering in Python
usage_metadata lives under run.extra, not at the top-level run object
The untagged bucket in your report tells you which code paths still need metadata — shrink it to zero before relying on the report for budgeting

Known limitation: LangSmith stores token counts per LLM call, not per top-level trace. If one user request triggers three LLM calls, you'll see three rows. The grouping by feature in this script correctly aggregates them — but if you want per-request cost, group by run.parent_run_id instead.

Tested on LangSmith SDK 0.2.x, LangChain 0.3.x, Python 3.12, Ubuntu 24.04