Problem: Your LLM Bill Is Growing but You Don't Know Why
You're paying $800/month for OpenAI tokens. You have five product features using LLMs. You have no idea which feature is burning 70% of the budget.
LangSmith captures token usage on every trace — but by default it's a flat list. Without metadata tags, you can't slice costs by feature, user segment, or environment.
You'll learn:
- How to tag traces with
metadataso every run is attributed to a feature - How to query aggregated token counts and estimated costs via the LangSmith SDK
- How to build a repeatable cost-by-feature report you can run daily
Time: 20 min | Difficulty: Intermediate
Why Default Tracing Isn't Enough
LangSmith auto-captures prompt_tokens, completion_tokens, and total tokens on every LLM call. But without metadata, all traces land in the same project with no dimension to group by.
Symptoms you've hit this wall:
- Dashboard shows total tokens but no breakdown by feature or team
- You can filter by model, but not by "which product feature called this model"
- Monthly cost spikes with no clear owner
The fix is two parts: tag every trace at invocation time, then query those tags in aggregate.
Solution
Step 1: Add Feature Metadata to Every Trace
LangSmith reads metadata from the LANGCHAIN_METADATA env var or from the metadata kwarg passed directly to your chain or runnable. The direct kwarg approach is more reliable in production.
# Requires: langchain>=0.2, langsmith>=0.1.70
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnableConfig
llm = ChatOpenAI(model="gpt-4o-mini")
config = RunnableConfig(
metadata={
"feature": "rag-search", # The product feature billing dimension
"env": "production",
"user_tier": "pro",
},
tags=["rag-search", "production"], # Tags for quick UI filtering
)
response = llm.invoke("Summarize this document", config=config)
For LangGraph agents, pass config through the graph invoke:
from langgraph.graph import StateGraph
# Build your graph as normal...
app = graph.compile()
result = app.invoke(
{"messages": [("user", "Your query here")]},
config=RunnableConfig(
metadata={"feature": "agent-copilot", "env": "production"}
),
)
Expected: Traces appear in LangSmith with a Metadata panel showing your keys.
Step 2: Verify Tags Are Landing on Traces
Before building the aggregation query, confirm metadata is actually being stored.
from langsmith import Client
client = Client()
# Pull the 5 most recent runs for your project
runs = client.list_runs(
project_name="your-project-name",
run_type="llm",
limit=5,
)
for run in runs:
print(run.name, run.extra.get("metadata", {}))
Expected output:
ChatOpenAI {'feature': 'rag-search', 'env': 'production', 'user_tier': 'pro'}
ChatOpenAI {'feature': 'agent-copilot', 'env': 'production', 'user_tier': 'pro'}
If metadata is empty:
- Using LCEL chains → Make sure
configis forwarded through every.invoke()call, not just the top-level chain - Using legacy
chain.run()→ Switch to.invoke(input, config=config)— the old interface doesn't propagate config cleanly
Step 3: Query Token Costs by Feature
LangSmith's list_runs supports server-side filtering on metadata fields. Pull all LLM runs for a date range, group by feature, and calculate cost.
from langsmith import Client
from collections import defaultdict
from datetime import datetime, timedelta, timezone
client = Client()
# Token cost map — update these when OpenAI reprices
COST_PER_1K = {
"gpt-4o": {"input": 0.0025, "output": 0.010},
"gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
"gpt-4-turbo": {"input": 0.010, "output": 0.030},
"claude-3-5-sonnet-20241022": {"input": 0.003, "output": 0.015},
}
def get_model_key(run_name: str) -> str:
"""Map LangSmith run names to cost table keys."""
name = run_name.lower()
if "gpt-4o-mini" in name:
return "gpt-4o-mini"
if "gpt-4o" in name:
return "gpt-4o"
if "gpt-4-turbo" in name:
return "gpt-4-turbo"
if "claude-3-5-sonnet" in name:
return "claude-3-5-sonnet-20241022"
return None
def calculate_cost(tokens_input: int, tokens_output: int, model_key: str) -> float:
if model_key not in COST_PER_1K:
return 0.0
rates = COST_PER_1K[model_key]
return (tokens_input / 1000 * rates["input"]) + (tokens_output / 1000 * rates["output"])
# Query: last 7 days of LLM runs
end = datetime.now(timezone.utc)
start = end - timedelta(days=7)
runs = client.list_runs(
project_name="your-project-name",
run_type="llm",
start_time=start,
end_time=end,
filter='has(metadata, \'{"feature": ""}\')', # Grab only tagged runs
)
feature_spend = defaultdict(lambda: {"input_tokens": 0, "output_tokens": 0, "cost": 0.0, "calls": 0})
for run in runs:
feature = run.extra.get("metadata", {}).get("feature", "untagged")
usage = run.extra.get("usage_metadata") or {}
input_tokens = usage.get("input_tokens", 0)
output_tokens = usage.get("output_tokens", 0)
model_key = get_model_key(run.name or "")
cost = calculate_cost(input_tokens, output_tokens, model_key)
feature_spend[feature]["input_tokens"] += input_tokens
feature_spend[feature]["output_tokens"] += output_tokens
feature_spend[feature]["cost"] += cost
feature_spend[feature]["calls"] += 1
# Print the report
print(f"\n{'Feature':<25} {'Calls':>8} {'Input Tok':>12} {'Output Tok':>12} {'Est. Cost':>12}")
print("-" * 73)
for feature, data in sorted(feature_spend.items(), key=lambda x: -x[1]["cost"]):
print(
f"{feature:<25} {data['calls']:>8} {data['input_tokens']:>12,} "
f"{data['output_tokens']:>12,} ${data['cost']:>11.4f}"
)
Expected output:
Feature Calls Input Tok Output Tok Est. Cost
-------------------------------------------------------------------------
rag-search 412 824,310 192,440 $2.2761
agent-copilot 118 603,200 411,820 $5.6594
summarizer 87 198,400 44,100 $0.0561
email-drafts 34 42,100 38,200 $0.0292
untagged 19 31,200 12,800 $0.0094
Step 4: Export to CSV for Dashboards
Pipe the same data into a CSV for weekly stakeholder reports or Grafana ingestion.
import csv
from pathlib import Path
output_path = Path("llm_spend_by_feature.csv")
with output_path.open("w", newline="") as f:
writer = csv.DictWriter(
f,
fieldnames=["feature", "calls", "input_tokens", "output_tokens", "estimated_cost_usd"],
)
writer.writeheader()
for feature, data in feature_spend.items():
writer.writerow({
"feature": feature,
"calls": data["calls"],
"input_tokens": data["input_tokens"],
"output_tokens": data["output_tokens"],
"estimated_cost_usd": round(data["cost"], 4),
})
print(f"Report written to {output_path}")
Schedule this with a cron job or GitHub Actions to run every morning at 06:00.
Step 5: Alert on Spend Spikes
Add a simple threshold check to catch runaway features before they compound.
DAILY_BUDGET_USD = {
"rag-search": 2.00,
"agent-copilot": 5.00,
"summarizer": 0.50,
}
alerts = []
for feature, data in feature_spend.items():
budget = DAILY_BUDGET_USD.get(feature)
if budget and data["cost"] > budget:
overage = data["cost"] - budget
alerts.append(f"⚠️ {feature}: ${data['cost']:.2f} spent (+${overage:.2f} over budget)")
if alerts:
print("\nCost Alerts:")
for alert in alerts:
print(alert)
else:
print("\n✅ All features within budget.")
Wire alerts into Slack via a webhook or PagerDuty for production monitoring.
Verification
Run a quick sanity check: compare your script's total cost against LangSmith's UI.
total = sum(d["cost"] for d in feature_spend.values())
print(f"Total estimated spend (7d): ${total:.4f}")
Cross-reference this against LangSmith → Project → Usage tab. The numbers won't be identical because LangSmith's UI may use slightly different pricing snapshots — but they should be within 5%. If they're off by more than 20%, check that your COST_PER_1K table matches current OpenAI pricing.
What You Learned
metadatainRunnableConfigis the correct way to tag traces — not environment variablesclient.list_runs()withstart_time/end_timeis efficient for date-range queries; avoid pulling all runs and filtering in Pythonusage_metadatalives underrun.extra, not at the top-level run object- The
untaggedbucket in your report tells you which code paths still need metadata — shrink it to zero before relying on the report for budgeting
Known limitation: LangSmith stores token counts per LLM call, not per top-level trace. If one user request triggers three LLM calls, you'll see three rows. The grouping by feature in this script correctly aggregates them — but if you want per-request cost, group by run.parent_run_id instead.
Tested on LangSmith SDK 0.2.x, LangChain 0.3.x, Python 3.12, Ubuntu 24.04