LangSmith Cost Analytics: Track LLM Spend by Feature

Tag LangSmith traces by feature, query token costs with the SDK, and build a spend-by-feature dashboard. Stop billing surprises before they hit.

Problem: Your LLM Bill Is Growing but You Don't Know Why

You're paying $800/month for OpenAI tokens. You have five product features using LLMs. You have no idea which feature is burning 70% of the budget.

LangSmith captures token usage on every trace — but by default it's a flat list. Without metadata tags, you can't slice costs by feature, user segment, or environment.

You'll learn:

  • How to tag traces with metadata so every run is attributed to a feature
  • How to query aggregated token counts and estimated costs via the LangSmith SDK
  • How to build a repeatable cost-by-feature report you can run daily

Time: 20 min | Difficulty: Intermediate


Why Default Tracing Isn't Enough

LangSmith auto-captures prompt_tokens, completion_tokens, and total tokens on every LLM call. But without metadata, all traces land in the same project with no dimension to group by.

Symptoms you've hit this wall:

  • Dashboard shows total tokens but no breakdown by feature or team
  • You can filter by model, but not by "which product feature called this model"
  • Monthly cost spikes with no clear owner

The fix is two parts: tag every trace at invocation time, then query those tags in aggregate.


Solution

Step 1: Add Feature Metadata to Every Trace

LangSmith reads metadata from the LANGCHAIN_METADATA env var or from the metadata kwarg passed directly to your chain or runnable. The direct kwarg approach is more reliable in production.

# Requires: langchain>=0.2, langsmith>=0.1.70
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnableConfig

llm = ChatOpenAI(model="gpt-4o-mini")

config = RunnableConfig(
    metadata={
        "feature": "rag-search",        # The product feature billing dimension
        "env": "production",
        "user_tier": "pro",
    },
    tags=["rag-search", "production"],  # Tags for quick UI filtering
)

response = llm.invoke("Summarize this document", config=config)

For LangGraph agents, pass config through the graph invoke:

from langgraph.graph import StateGraph

# Build your graph as normal...
app = graph.compile()

result = app.invoke(
    {"messages": [("user", "Your query here")]},
    config=RunnableConfig(
        metadata={"feature": "agent-copilot", "env": "production"}
    ),
)

Expected: Traces appear in LangSmith with a Metadata panel showing your keys.


Step 2: Verify Tags Are Landing on Traces

Before building the aggregation query, confirm metadata is actually being stored.

from langsmith import Client

client = Client()

# Pull the 5 most recent runs for your project
runs = client.list_runs(
    project_name="your-project-name",
    run_type="llm",
    limit=5,
)

for run in runs:
    print(run.name, run.extra.get("metadata", {}))

Expected output:

ChatOpenAI {'feature': 'rag-search', 'env': 'production', 'user_tier': 'pro'}
ChatOpenAI {'feature': 'agent-copilot', 'env': 'production', 'user_tier': 'pro'}

If metadata is empty:

  • Using LCEL chains → Make sure config is forwarded through every .invoke() call, not just the top-level chain
  • Using legacy chain.run() → Switch to .invoke(input, config=config) — the old interface doesn't propagate config cleanly

Step 3: Query Token Costs by Feature

LangSmith's list_runs supports server-side filtering on metadata fields. Pull all LLM runs for a date range, group by feature, and calculate cost.

from langsmith import Client
from collections import defaultdict
from datetime import datetime, timedelta, timezone

client = Client()

# Token cost map — update these when OpenAI reprices
COST_PER_1K = {
    "gpt-4o":           {"input": 0.0025, "output": 0.010},
    "gpt-4o-mini":      {"input": 0.00015, "output": 0.0006},
    "gpt-4-turbo":      {"input": 0.010,  "output": 0.030},
    "claude-3-5-sonnet-20241022": {"input": 0.003, "output": 0.015},
}

def get_model_key(run_name: str) -> str:
    """Map LangSmith run names to cost table keys."""
    name = run_name.lower()
    if "gpt-4o-mini" in name:
        return "gpt-4o-mini"
    if "gpt-4o" in name:
        return "gpt-4o"
    if "gpt-4-turbo" in name:
        return "gpt-4-turbo"
    if "claude-3-5-sonnet" in name:
        return "claude-3-5-sonnet-20241022"
    return None

def calculate_cost(tokens_input: int, tokens_output: int, model_key: str) -> float:
    if model_key not in COST_PER_1K:
        return 0.0
    rates = COST_PER_1K[model_key]
    return (tokens_input / 1000 * rates["input"]) + (tokens_output / 1000 * rates["output"])

# Query: last 7 days of LLM runs
end = datetime.now(timezone.utc)
start = end - timedelta(days=7)

runs = client.list_runs(
    project_name="your-project-name",
    run_type="llm",
    start_time=start,
    end_time=end,
    filter='has(metadata, \'{"feature": ""}\')',  # Grab only tagged runs
)

feature_spend = defaultdict(lambda: {"input_tokens": 0, "output_tokens": 0, "cost": 0.0, "calls": 0})

for run in runs:
    feature = run.extra.get("metadata", {}).get("feature", "untagged")
    usage = run.extra.get("usage_metadata") or {}

    input_tokens  = usage.get("input_tokens", 0)
    output_tokens = usage.get("output_tokens", 0)
    model_key     = get_model_key(run.name or "")
    cost          = calculate_cost(input_tokens, output_tokens, model_key)

    feature_spend[feature]["input_tokens"]  += input_tokens
    feature_spend[feature]["output_tokens"] += output_tokens
    feature_spend[feature]["cost"]          += cost
    feature_spend[feature]["calls"]         += 1

# Print the report
print(f"\n{'Feature':<25} {'Calls':>8} {'Input Tok':>12} {'Output Tok':>12} {'Est. Cost':>12}")
print("-" * 73)
for feature, data in sorted(feature_spend.items(), key=lambda x: -x[1]["cost"]):
    print(
        f"{feature:<25} {data['calls']:>8} {data['input_tokens']:>12,} "
        f"{data['output_tokens']:>12,} ${data['cost']:>11.4f}"
    )

Expected output:

Feature                   Calls   Input Tok   Output Tok     Est. Cost
-------------------------------------------------------------------------
rag-search                  412     824,310      192,440       $2.2761
agent-copilot               118     603,200      411,820       $5.6594
summarizer                   87     198,400       44,100       $0.0561
email-drafts                 34      42,100       38,200       $0.0292
untagged                     19      31,200       12,800       $0.0094

Step 4: Export to CSV for Dashboards

Pipe the same data into a CSV for weekly stakeholder reports or Grafana ingestion.

import csv
from pathlib import Path

output_path = Path("llm_spend_by_feature.csv")

with output_path.open("w", newline="") as f:
    writer = csv.DictWriter(
        f,
        fieldnames=["feature", "calls", "input_tokens", "output_tokens", "estimated_cost_usd"],
    )
    writer.writeheader()
    for feature, data in feature_spend.items():
        writer.writerow({
            "feature":               feature,
            "calls":                 data["calls"],
            "input_tokens":          data["input_tokens"],
            "output_tokens":         data["output_tokens"],
            "estimated_cost_usd":    round(data["cost"], 4),
        })

print(f"Report written to {output_path}")

Schedule this with a cron job or GitHub Actions to run every morning at 06:00.


Step 5: Alert on Spend Spikes

Add a simple threshold check to catch runaway features before they compound.

DAILY_BUDGET_USD = {
    "rag-search":    2.00,
    "agent-copilot": 5.00,
    "summarizer":    0.50,
}

alerts = []
for feature, data in feature_spend.items():
    budget = DAILY_BUDGET_USD.get(feature)
    if budget and data["cost"] > budget:
        overage = data["cost"] - budget
        alerts.append(f"⚠️  {feature}: ${data['cost']:.2f} spent (+${overage:.2f} over budget)")

if alerts:
    print("\nCost Alerts:")
    for alert in alerts:
        print(alert)
else:
    print("\n✅ All features within budget.")

Wire alerts into Slack via a webhook or PagerDuty for production monitoring.


Verification

Run a quick sanity check: compare your script's total cost against LangSmith's UI.

total = sum(d["cost"] for d in feature_spend.values())
print(f"Total estimated spend (7d): ${total:.4f}")

Cross-reference this against LangSmith → Project → Usage tab. The numbers won't be identical because LangSmith's UI may use slightly different pricing snapshots — but they should be within 5%. If they're off by more than 20%, check that your COST_PER_1K table matches current OpenAI pricing.


What You Learned

  • metadata in RunnableConfig is the correct way to tag traces — not environment variables
  • client.list_runs() with start_time/end_time is efficient for date-range queries; avoid pulling all runs and filtering in Python
  • usage_metadata lives under run.extra, not at the top-level run object
  • The untagged bucket in your report tells you which code paths still need metadata — shrink it to zero before relying on the report for budgeting

Known limitation: LangSmith stores token counts per LLM call, not per top-level trace. If one user request triggers three LLM calls, you'll see three rows. The grouping by feature in this script correctly aggregates them — but if you want per-request cost, group by run.parent_run_id instead.

Tested on LangSmith SDK 0.2.x, LangChain 0.3.x, Python 3.12, Ubuntu 24.04