Problem: You Don't Know If Your LLM Got Worse
You change a prompt. The app feels fine in manual testing. You ship it. Three days later, users report that summaries are truncated and citations stopped working.
This is what happens without evaluation datasets. You have no repeatable benchmark — no way to catch regressions before they reach production.
You'll learn:
- How to create and version LangSmith datasets from real traces
- How to write custom evaluators for your specific quality criteria
- How to run evals in CI so regressions are caught automatically
Time: 20 min | Difficulty: Intermediate
Why LangSmith Datasets Exist
A LangSmith dataset is a versioned collection of input/output pairs — examples that define what "correct" looks like for your chain or agent.
Every time you change a model, prompt, or retrieval strategy, you run your chain against the same dataset and score the results. If scores drop, you know exactly what broke and when.
Dataset (inputs + reference outputs)
│
▼
Your LLM Chain
│
▼
Evaluator (scores each output)
│
▼
Experiment Results (stored in LangSmith)
Datasets are independent of your code. You build them once and reuse them across every experiment.
Setup
Step 1: Install the LangSmith SDK
# LangSmith SDK ships inside langchain-core
pip install langsmith langchain-openai
# Verify
python -c "import langsmith; print(langsmith.__version__)"
Expected: 0.2.x or higher
Set your credentials:
export LANGSMITH_API_KEY="ls__your_key_here"
export LANGSMITH_TRACING=true
export OPENAI_API_KEY="sk-..."
Step 2: Create a Dataset
You have three ways to create a dataset. Start with the programmatic approach — it's the most reproducible.
from langsmith import Client
client = Client()
# Define your examples: input dict + reference output dict
examples = [
{
"inputs": {"question": "What is RAG?"},
"outputs": {"answer": "RAG stands for Retrieval-Augmented Generation. It combines a retriever that fetches relevant documents with an LLM that generates answers grounded in those documents."},
},
{
"inputs": {"question": "What is the difference between LangChain and LangGraph?"},
"outputs": {"answer": "LangChain is a framework for building LLM chains and pipelines. LangGraph extends it with stateful, graph-based agent workflows that support cycles and branching."},
},
{
"inputs": {"question": "When should I use streaming in an LLM API call?"},
"outputs": {"answer": "Use streaming when the user is waiting for a response in real time. It reduces perceived latency by sending tokens as they generate instead of waiting for the full response."},
},
]
dataset = client.create_dataset(
dataset_name="qa-benchmark-v1",
description="Q&A pairs for evaluating answer quality and factual accuracy",
)
client.create_examples(
inputs=[e["inputs"] for e in examples],
outputs=[e["outputs"] for e in examples],
dataset_id=dataset.id,
)
print(f"Dataset created: {dataset.id}")
Expected output:
Dataset created: ds_abc123...
Step 3: Add Examples from Production Traces
Manual examples are a good start. The best datasets come from real traffic — traces where users got bad answers, edge cases that slipped through, or inputs your chain handles unusually well.
# Fetch recent traces from a specific project
runs = client.list_runs(
project_name="my-qa-chain",
execution_order=1, # top-level runs only
error=False, # skip errored runs
limit=50,
)
# Add selected runs directly as dataset examples
for run in runs:
# Filter for traces worth benchmarking
if run.feedback_stats and run.feedback_stats.get("user_score", {}).get("avg", 1) < 0.6:
client.create_examples(
inputs=[run.inputs],
outputs=[run.outputs],
dataset_id=dataset.id,
)
This is how you turn user feedback into a regression test suite. Any trace where a user gave a thumbs-down is a candidate.
Step 4: Define Your Evaluator
The evaluator is a Python function that takes a run (what your chain produced) and an example (what the dataset says is correct) and returns a score.
from langsmith.schemas import Run, Example
from langsmith.evaluation import EvaluationResult
def answer_relevance_evaluator(run: Run, example: Example) -> EvaluationResult:
"""
Scores whether the chain's answer addresses the question.
Uses an LLM judge — fast and consistent across thousands of examples.
"""
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
question = example.inputs["question"]
reference = example.outputs["answer"]
prediction = run.outputs["answer"]
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_messages([
("system", "You are an evaluator. Score the answer on relevance and accuracy compared to the reference. Return only a JSON object: {{\"score\": 0-1, \"reason\": \"one sentence\"}}"),
("human", "Question: {question}\nReference: {reference}\nPrediction: {prediction}"),
])
result = llm.invoke(prompt.format_messages(
question=question,
reference=reference,
prediction=prediction,
))
import json
parsed = json.loads(result.content)
return EvaluationResult(
key="answer_relevance",
score=parsed["score"],
comment=parsed["reason"],
)
You can stack multiple evaluators — one for relevance, one for factual grounding, one for response length. Each produces a separate score column in your results.
Step 5: Run the Evaluation
from langsmith.evaluation import evaluate
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
# The chain you want to test
def qa_chain(inputs: dict) -> dict:
llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = ChatPromptTemplate.from_messages([
("system", "Answer the question concisely and accurately."),
("human", "{question}"),
])
response = llm.invoke(prompt.format_messages(question=inputs["question"]))
return {"answer": response.content}
results = evaluate(
qa_chain,
data="qa-benchmark-v1", # dataset name or ID
evaluators=[answer_relevance_evaluator],
experiment_prefix="gpt-4o-baseline",
metadata={"model": "gpt-4o", "prompt_version": "v1"},
)
print(results)
Expected output:
View results at: https://smith.langchain.com/o/.../datasets/.../compare
{'answer_relevance': {'min': 0.6, 'mean': 0.84, 'max': 1.0}}
Every run is stored as a named experiment. You can compare experiments side-by-side in the LangSmith UI.
Step 6: Add a String Match Evaluator for Deterministic Checks
LLM judges are good for subjective quality. For deterministic criteria — does the answer contain the right entity, is the format correct — use exact or fuzzy string matching.
from langsmith.evaluation import EvaluationResult
def contains_key_term_evaluator(run: Run, example: Example) -> EvaluationResult:
"""
Checks if a required term from the reference appears in the prediction.
Useful for verifying factual grounding without an LLM judge call.
"""
reference = example.outputs["answer"].lower()
prediction = run.outputs["answer"].lower()
# Extract first noun phrase as the "key term" (simplified)
# In production: use spacy or a term list per example
key_terms = [w for w in reference.split() if len(w) > 6][:3]
hits = sum(1 for term in key_terms if term in prediction)
score = hits / len(key_terms) if key_terms else 0.0
return EvaluationResult(
key="key_term_coverage",
score=score,
comment=f"{hits}/{len(key_terms)} key terms present",
)
Pass both evaluators to a single evaluate() call:
results = evaluate(
qa_chain,
data="qa-benchmark-v1",
evaluators=[answer_relevance_evaluator, contains_key_term_evaluator],
experiment_prefix="gpt-4o-dual-eval",
)
Step 7: Run Evals in CI
Catch regressions before merge. Add this to your GitHub Actions workflow:
# .github/workflows/eval.yml
name: LLM Evaluation
on:
pull_request:
paths:
- "src/chains/**"
- "prompts/**"
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install dependencies
run: pip install langsmith langchain-openai
- name: Run evaluation
env:
LANGSMITH_API_KEY: ${{ secrets.LANGSMITH_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: python scripts/run_eval.py --fail-below 0.75
Your run_eval.py script:
import argparse
from langsmith.evaluation import evaluate
parser = argparse.ArgumentParser()
parser.add_argument("--fail-below", type=float, default=0.75)
args = parser.parse_args()
results = evaluate(
qa_chain,
data="qa-benchmark-v1",
evaluators=[answer_relevance_evaluator],
experiment_prefix="ci-pr-eval",
)
mean_score = results.to_pandas()["feedback.answer_relevance"].mean()
print(f"Mean answer_relevance: {mean_score:.3f}")
if mean_score < args.fail_below:
print(f"FAIL: score {mean_score:.3f} below threshold {args.fail_below}")
exit(1)
print("PASS")
PRs that degrade your benchmark score will now block merge.
Verification
Open the LangSmith UI and navigate to Datasets & Testing → qa-benchmark-v1 → Experiments.
You should see:
- Each experiment listed with a timestamp and prefix
- Per-example scores visible in the comparison table
- Aggregate metrics (mean, min, max) per evaluator key
To compare two experiments programmatically:
# List experiments for a dataset
experiments = client.list_projects(reference_dataset_name="qa-benchmark-v1")
for exp in experiments:
print(exp.name, exp.extra.get("metadata", {}))
What You Learned
- A LangSmith dataset is a versioned benchmark — build it once, run it on every change
- LLM judges work well for subjective quality; string evaluators handle deterministic checks
experiment_prefixandmetadataare how you compare runs over time in the UI- CI integration turns eval from a manual step into a merge gate
When NOT to use this approach: For very long-form outputs (multi-page documents, code generation), LLM-as-judge costs add up fast. Consider heuristic evaluators or human review sampling instead of evaluating every CI run.
Tested on LangSmith SDK 0.2.x, LangChain 0.3.x, Python 3.12, Ubuntu 24.04