LangSmith Evaluation: Automated LLM Quality Testing Guide 2026

Problem: You Don't Know If Your LLM Got Worse

You tweak a prompt. Swap a model. Adjust a RAG retriever. Then what? Without a structured evaluation pipeline, you're flying blind — relying on vibes and spot checks to decide if the change was an improvement.

LangSmith's evaluation suite gives you a repeatable way to measure LLM output quality: define a dataset, run evaluators, and compare scores across experiments automatically.

You'll learn:

How to create a LangSmith dataset from real production traces
How to run built-in and custom evaluators against a target function
How to compare experiments and catch regressions in CI

Time: 25 min | Difficulty: Intermediate

Why Ad-Hoc Testing Fails for LLMs

Unit tests work for deterministic code. LLMs are probabilistic — the same input can produce different outputs, and "correct" is often subjective. What you need instead:

A fixed dataset of inputs (and optionally expected outputs)
Evaluators that score each output on dimensions you care about (correctness, tone, groundedness)
Experiment tracking so you can compare run A vs run B with actual numbers

LangSmith wraps all three into one workflow.

Prerequisites:

LangSmith account (free tier works)
langsmith Python SDK >= 0.1.0
A LangChain or plain Python LLM function to evaluate

Solution

Step 1: Install the SDK and Authenticate

pip install langsmith openai

# Set these in your shell or .env
export LANGSMITH_API_KEY="ls__your_key_here"
export LANGSMITH_TRACING_V2=true
export OPENAI_API_KEY="sk-your_key_here"

Verify the connection:

from langsmith import Client

client = Client()
print(client.list_projects())  # Should return your projects list

Expected output: A list of Project objects (empty list is fine on a new account).

If it fails:

AuthenticationError → Double-check LANGSMITH_API_KEY is exported in the current shell session
Connection refused → Check firewall/proxy; LangSmith calls api.smith.langchain.com

Step 2: Create an Evaluation Dataset

A dataset is a collection of (input, reference_output) pairs. Reference outputs are optional but required for correctness-based evaluators.

from langsmith import Client

client = Client()

# Define your examples — these are the inputs your LLM will be tested on
examples = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": {"answer": "Paris"},
    },
    {
        "inputs": {"question": "Who wrote Hamlet?"},
        "outputs": {"answer": "William Shakespeare"},
    },
    {
        "inputs": {"question": "What does RAG stand for in AI?"},
        "outputs": {"answer": "Retrieval-Augmented Generation"},
    },
    {
        "inputs": {"question": "What year was Python first released?"},
        "outputs": {"answer": "1991"},
    },
]

dataset_name = "qa-factual-v1"

# Creates the dataset if it doesn't exist; safe to re-run
if not client.has_dataset(dataset_name=dataset_name):
    dataset = client.create_dataset(dataset_name=dataset_name)
    client.create_examples(
        inputs=[e["inputs"] for e in examples],
        outputs=[e["outputs"] for e in examples],
        dataset_id=dataset.id,
    )
    print(f"Created dataset with {len(examples)} examples")
else:
    print("Dataset already exists — skipping creation")

Tip: In production, populate datasets from real traces. LangSmith's UI lets you add any logged run directly to a dataset with one click — no manual data entry.

Step 3: Define the Target Function

The target function is what you're evaluating. It takes a dict of inputs and returns a dict of outputs. This is where your LLM call lives.

from openai import OpenAI

openai_client = OpenAI()

def my_llm(inputs: dict) -> dict:
    """
    Wraps an OpenAI call. LangSmith passes each dataset example's
    'inputs' dict here and records the returned 'outputs' dict.
    """
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "Answer the question concisely. One sentence maximum.",
            },
            {
                "role": "user",
                "content": inputs["question"],
            },
        ],
        temperature=0,  # Deterministic outputs make evaluation more stable
    )
    return {"answer": response.choices[0].message.content.strip()}

Why temperature=0: Evaluation results are more reliable when outputs are consistent across runs. Use your production temperature in final benchmarks, not during iteration.

Step 4: Run an Evaluation with Built-In Evaluators

LangSmith ships several ready-to-use evaluators. LangChainStringEvaluator("qa") checks if the output answers the question correctly relative to the reference.

from langsmith.evaluation import evaluate, LangChainStringEvaluator

# "qa" evaluator uses an LLM judge to compare output vs reference answer
qa_evaluator = LangChainStringEvaluator(
    "qa",
    config={"llm": openai_client},  # The LLM used as judge
)

results = evaluate(
    my_llm,                          # Target function
    data=dataset_name,               # Dataset name or ID
    evaluators=[qa_evaluator],
    experiment_prefix="gpt-4o-mini-baseline",
    metadata={"model": "gpt-4o-mini", "temperature": 0},
)

print(results.to_pandas())

Expected output:

   input.question          output.answer   feedback.qa
0  What is the capital...  Paris           1.0
1  Who wrote Hamlet?       William Shak... 1.0
2  What does RAG stand...  Retrieval-Aug.. 1.0
3  What year was Python... 1991            1.0

Score of 1.0 = correct, 0.0 = incorrect. The results also appear in the LangSmith UI under Experiments.

Step 5: Write a Custom Evaluator

Built-in evaluators cover correctness, but you often need custom checks — response length, JSON validity, tone, or domain-specific rules.

from langsmith.schemas import Run, Example

def brevity_evaluator(run: Run, example: Example) -> dict:
    """
    Penalizes answers longer than 20 words.
    Returns a score between 0.0 and 1.0.
    """
    output = run.outputs.get("answer", "")
    word_count = len(output.split())

    # Score decays linearly after 20 words; floors at 0
    score = max(0.0, 1.0 - max(0, word_count - 20) / 20)

    return {
        "key": "brevity",
        "score": round(score, 2),
        "comment": f"{word_count} words",
    }


# Run again with both evaluators
results = evaluate(
    my_llm,
    data=dataset_name,
    evaluators=[qa_evaluator, brevity_evaluator],
    experiment_prefix="gpt-4o-mini-brevity-check",
    metadata={"model": "gpt-4o-mini", "temperature": 0},
)

Custom evaluators receive the full Run object (which includes inputs, outputs, and latency) and the Example (inputs + reference outputs). You can make LLM calls inside them for LLM-as-judge patterns.

Step 6: Compare Experiments

Now swap the model and run again — this is where the value compounds.

def my_llm_upgraded(inputs: dict) -> dict:
    """Same function, different model — compare against baseline."""
    response = openai_client.chat.completions.create(
        model="gpt-4o",  # Upgraded from gpt-4o-mini
        messages=[
            {"role": "system", "content": "Answer the question concisely. One sentence maximum."},
            {"role": "user", "content": inputs["question"]},
        ],
        temperature=0,
    )
    return {"answer": response.choices[0].message.content.strip()}


results_v2 = evaluate(
    my_llm_upgraded,
    data=dataset_name,
    evaluators=[qa_evaluator, brevity_evaluator],
    experiment_prefix="gpt-4o-upgraded",
    metadata={"model": "gpt-4o", "temperature": 0},
)

In the LangSmith UI, open Datasets & Testing → your dataset → Compare Experiments to see a side-by-side score table across all runs.

Step 7: Add Evaluation to CI

Run evaluations on every PR to catch regressions before they ship.

# eval_ci.py — run this in your CI pipeline
import sys
from langsmith.evaluation import evaluate, LangChainStringEvaluator
from openai import OpenAI

openai_client = OpenAI()

def my_llm(inputs):
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Answer the question concisely."},
            {"role": "user", "content": inputs["question"]},
        ],
        temperature=0,
    )
    return {"answer": response.choices[0].message.content.strip()}


qa_evaluator = LangChainStringEvaluator("qa", config={"llm": openai_client})

results = evaluate(
    my_llm,
    data="qa-factual-v1",
    evaluators=[qa_evaluator],
    experiment_prefix="ci-run",
)

df = results.to_pandas()
avg_qa_score = df["feedback.qa"].mean()

print(f"Average QA score: {avg_qa_score:.2f}")

# Fail the CI pipeline if quality drops below threshold
PASS_THRESHOLD = 0.85
if avg_qa_score < PASS_THRESHOLD:
    print(f"FAIL: Score {avg_qa_score:.2f} below threshold {PASS_THRESHOLD}")
    sys.exit(1)

print("PASS: Quality threshold met")
sys.exit(0)

Add to .github/workflows/eval.yml:

name: LLM Evaluation

on:
  pull_request:
    paths:
      - "src/llm/**"
      - "prompts/**"

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install langsmith openai
      - run: python eval_ci.py
        env:
          LANGSMITH_API_KEY: ${{ secrets.LANGSMITH_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

The pipeline now fails automatically if your LLM changes drop average QA score below 0.85.

Verification

Run a full eval end-to-end and check the results:

df = results.to_pandas()
print(df[["input.question", "output.answer", "feedback.qa", "feedback.brevity"]])
print(f"\nMean QA score: {df['feedback.qa'].mean():.2f}")
print(f"Mean brevity score: {df['feedback.brevity'].mean():.2f}")

You should see: A DataFrame with one row per dataset example, scores populated in every feedback column, no NaN values.

Check the LangSmith UI at smith.langchain.com → your project → Experiments to see the run logged with all metadata and per-example scores.

What You Learned

LangSmith evaluations need three things: a dataset, a target function, and evaluators
Built-in evaluators like "qa" use an LLM-as-judge pattern — they cost tokens but scale to subjective quality checks
Custom evaluators are plain Python functions — use them for deterministic checks (length, format, regex) with zero latency overhead
Comparing experiments by model, prompt, or temperature is where the real value is — one eval run in isolation tells you little
CI integration turns evaluation from a manual chore into a quality gate

Limitation: LLM-as-judge evaluators ("qa", "cot_qa", "criteria") add latency and cost proportional to your dataset size. For large datasets (1000+ examples), run them nightly rather than on every PR — use fast deterministic evaluators in the hot path instead.

Tested on LangSmith SDK 0.2.x, Python 3.12, OpenAI gpt-4o-mini and gpt-4o, Ubuntu 24.04