LangSmith Datasets: Build and Manage Evaluation Benchmarks

Problem: You Don't Know If Your LLM Got Worse

You change a prompt. The app feels fine in manual testing. You ship it. Three days later, users report that summaries are truncated and citations stopped working.

This is what happens without evaluation datasets. You have no repeatable benchmark — no way to catch regressions before they reach production.

You'll learn:

How to create and version LangSmith datasets from real traces
How to write custom evaluators for your specific quality criteria
How to run evals in CI so regressions are caught automatically

Time: 20 min | Difficulty: Intermediate

Why LangSmith Datasets Exist

A LangSmith dataset is a versioned collection of input/output pairs — examples that define what "correct" looks like for your chain or agent.

Every time you change a model, prompt, or retrieval strategy, you run your chain against the same dataset and score the results. If scores drop, you know exactly what broke and when.

Dataset (inputs + reference outputs)
        │
        ▼
   Your LLM Chain
        │
        ▼
   Evaluator (scores each output)
        │
        ▼
   Experiment Results (stored in LangSmith)

Datasets are independent of your code. You build them once and reuse them across every experiment.

Setup

Step 1: Install the LangSmith SDK

# LangSmith SDK ships inside langchain-core
pip install langsmith langchain-openai

# Verify
python -c "import langsmith; print(langsmith.__version__)"

Expected: 0.2.x or higher

Set your credentials:

export LANGSMITH_API_KEY="ls__your_key_here"
export LANGSMITH_TRACING=true
export OPENAI_API_KEY="sk-..."

Step 2: Create a Dataset

You have three ways to create a dataset. Start with the programmatic approach — it's the most reproducible.

from langsmith import Client

client = Client()

# Define your examples: input dict + reference output dict
examples = [
    {
        "inputs": {"question": "What is RAG?"},
        "outputs": {"answer": "RAG stands for Retrieval-Augmented Generation. It combines a retriever that fetches relevant documents with an LLM that generates answers grounded in those documents."},
    },
    {
        "inputs": {"question": "What is the difference between LangChain and LangGraph?"},
        "outputs": {"answer": "LangChain is a framework for building LLM chains and pipelines. LangGraph extends it with stateful, graph-based agent workflows that support cycles and branching."},
    },
    {
        "inputs": {"question": "When should I use streaming in an LLM API call?"},
        "outputs": {"answer": "Use streaming when the user is waiting for a response in real time. It reduces perceived latency by sending tokens as they generate instead of waiting for the full response."},
    },
]

dataset = client.create_dataset(
    dataset_name="qa-benchmark-v1",
    description="Q&A pairs for evaluating answer quality and factual accuracy",
)

client.create_examples(
    inputs=[e["inputs"] for e in examples],
    outputs=[e["outputs"] for e in examples],
    dataset_id=dataset.id,
)

print(f"Dataset created: {dataset.id}")

Expected output:

Dataset created: ds_abc123...

Step 3: Add Examples from Production Traces

Manual examples are a good start. The best datasets come from real traffic — traces where users got bad answers, edge cases that slipped through, or inputs your chain handles unusually well.

# Fetch recent traces from a specific project
runs = client.list_runs(
    project_name="my-qa-chain",
    execution_order=1,       # top-level runs only
    error=False,             # skip errored runs
    limit=50,
)

# Add selected runs directly as dataset examples
for run in runs:
    # Filter for traces worth benchmarking
    if run.feedback_stats and run.feedback_stats.get("user_score", {}).get("avg", 1) < 0.6:
        client.create_examples(
            inputs=[run.inputs],
            outputs=[run.outputs],
            dataset_id=dataset.id,
        )

This is how you turn user feedback into a regression test suite. Any trace where a user gave a thumbs-down is a candidate.

Step 4: Define Your Evaluator

The evaluator is a Python function that takes a run (what your chain produced) and an example (what the dataset says is correct) and returns a score.

from langsmith.schemas import Run, Example
from langsmith.evaluation import EvaluationResult

def answer_relevance_evaluator(run: Run, example: Example) -> EvaluationResult:
    """
    Scores whether the chain's answer addresses the question.
    Uses an LLM judge — fast and consistent across thousands of examples.
    """
    from langchain_openai import ChatOpenAI
    from langchain_core.prompts import ChatPromptTemplate

    question = example.inputs["question"]
    reference = example.outputs["answer"]
    prediction = run.outputs["answer"]

    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

    prompt = ChatPromptTemplate.from_messages([
        ("system", "You are an evaluator. Score the answer on relevance and accuracy compared to the reference. Return only a JSON object: {{\"score\": 0-1, \"reason\": \"one sentence\"}}"),
        ("human", "Question: {question}\nReference: {reference}\nPrediction: {prediction}"),
    ])

    result = llm.invoke(prompt.format_messages(
        question=question,
        reference=reference,
        prediction=prediction,
    ))

    import json
    parsed = json.loads(result.content)

    return EvaluationResult(
        key="answer_relevance",
        score=parsed["score"],
        comment=parsed["reason"],
    )

You can stack multiple evaluators — one for relevance, one for factual grounding, one for response length. Each produces a separate score column in your results.

Step 5: Run the Evaluation

from langsmith.evaluation import evaluate
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

# The chain you want to test
def qa_chain(inputs: dict) -> dict:
    llm = ChatOpenAI(model="gpt-4o", temperature=0)
    prompt = ChatPromptTemplate.from_messages([
        ("system", "Answer the question concisely and accurately."),
        ("human", "{question}"),
    ])
    response = llm.invoke(prompt.format_messages(question=inputs["question"]))
    return {"answer": response.content}


results = evaluate(
    qa_chain,
    data="qa-benchmark-v1",          # dataset name or ID
    evaluators=[answer_relevance_evaluator],
    experiment_prefix="gpt-4o-baseline",
    metadata={"model": "gpt-4o", "prompt_version": "v1"},
)

print(results)

Expected output:

View results at: https://smith.langchain.com/o/.../datasets/.../compare
{'answer_relevance': {'min': 0.6, 'mean': 0.84, 'max': 1.0}}

Every run is stored as a named experiment. You can compare experiments side-by-side in the LangSmith UI.

Step 6: Add a String Match Evaluator for Deterministic Checks

LLM judges are good for subjective quality. For deterministic criteria — does the answer contain the right entity, is the format correct — use exact or fuzzy string matching.

from langsmith.evaluation import EvaluationResult

def contains_key_term_evaluator(run: Run, example: Example) -> EvaluationResult:
    """
    Checks if a required term from the reference appears in the prediction.
    Useful for verifying factual grounding without an LLM judge call.
    """
    reference = example.outputs["answer"].lower()
    prediction = run.outputs["answer"].lower()

    # Extract first noun phrase as the "key term" (simplified)
    # In production: use spacy or a term list per example
    key_terms = [w for w in reference.split() if len(w) > 6][:3]

    hits = sum(1 for term in key_terms if term in prediction)
    score = hits / len(key_terms) if key_terms else 0.0

    return EvaluationResult(
        key="key_term_coverage",
        score=score,
        comment=f"{hits}/{len(key_terms)} key terms present",
    )

Pass both evaluators to a single evaluate() call:

results = evaluate(
    qa_chain,
    data="qa-benchmark-v1",
    evaluators=[answer_relevance_evaluator, contains_key_term_evaluator],
    experiment_prefix="gpt-4o-dual-eval",
)

Step 7: Run Evals in CI

Catch regressions before merge. Add this to your GitHub Actions workflow:

# .github/workflows/eval.yml
name: LLM Evaluation

on:
  pull_request:
    paths:
      - "src/chains/**"
      - "prompts/**"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install dependencies
        run: pip install langsmith langchain-openai

      - name: Run evaluation
        env:
          LANGSMITH_API_KEY: ${{ secrets.LANGSMITH_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python scripts/run_eval.py --fail-below 0.75

Your run_eval.py script:

import argparse
from langsmith.evaluation import evaluate

parser = argparse.ArgumentParser()
parser.add_argument("--fail-below", type=float, default=0.75)
args = parser.parse_args()

results = evaluate(
    qa_chain,
    data="qa-benchmark-v1",
    evaluators=[answer_relevance_evaluator],
    experiment_prefix="ci-pr-eval",
)

mean_score = results.to_pandas()["feedback.answer_relevance"].mean()
print(f"Mean answer_relevance: {mean_score:.3f}")

if mean_score < args.fail_below:
    print(f"FAIL: score {mean_score:.3f} below threshold {args.fail_below}")
    exit(1)

print("PASS")

PRs that degrade your benchmark score will now block merge.

Verification

Open the LangSmith UI and navigate to Datasets & Testing → qa-benchmark-v1 → Experiments.

You should see:

Each experiment listed with a timestamp and prefix
Per-example scores visible in the comparison table
Aggregate metrics (mean, min, max) per evaluator key

To compare two experiments programmatically:

# List experiments for a dataset
experiments = client.list_projects(reference_dataset_name="qa-benchmark-v1")
for exp in experiments:
    print(exp.name, exp.extra.get("metadata", {}))

What You Learned

A LangSmith dataset is a versioned benchmark — build it once, run it on every change
LLM judges work well for subjective quality; string evaluators handle deterministic checks
experiment_prefix and metadata are how you compare runs over time in the UI
CI integration turns eval from a manual step into a merge gate

When NOT to use this approach: For very long-form outputs (multi-page documents, code generation), LLM-as-judge costs add up fast. Consider heuristic evaluators or human review sampling instead of evaluating every CI run.

Tested on LangSmith SDK 0.2.x, LangChain 0.3.x, Python 3.12, Ubuntu 24.04