Problem: You Don't Know If Your LLM Got Worse
You tweak a prompt. Swap a model. Adjust a RAG retriever. Then what? Without a structured evaluation pipeline, you're flying blind — relying on vibes and spot checks to decide if the change was an improvement.
LangSmith's evaluation suite gives you a repeatable way to measure LLM output quality: define a dataset, run evaluators, and compare scores across experiments automatically.
You'll learn:
- How to create a LangSmith dataset from real production traces
- How to run built-in and custom evaluators against a target function
- How to compare experiments and catch regressions in CI
Time: 25 min | Difficulty: Intermediate
Why Ad-Hoc Testing Fails for LLMs
Unit tests work for deterministic code. LLMs are probabilistic — the same input can produce different outputs, and "correct" is often subjective. What you need instead:
- A fixed dataset of inputs (and optionally expected outputs)
- Evaluators that score each output on dimensions you care about (correctness, tone, groundedness)
- Experiment tracking so you can compare run A vs run B with actual numbers
LangSmith wraps all three into one workflow.
Prerequisites:
- LangSmith account (free tier works)
langsmithPython SDK >= 0.1.0- A LangChain or plain Python LLM function to evaluate
Solution
Step 1: Install the SDK and Authenticate
pip install langsmith openai
# Set these in your shell or .env
export LANGSMITH_API_KEY="ls__your_key_here"
export LANGSMITH_TRACING_V2=true
export OPENAI_API_KEY="sk-your_key_here"
Verify the connection:
from langsmith import Client
client = Client()
print(client.list_projects()) # Should return your projects list
Expected output: A list of Project objects (empty list is fine on a new account).
If it fails:
AuthenticationError→ Double-checkLANGSMITH_API_KEYis exported in the current shell sessionConnection refused→ Check firewall/proxy; LangSmith callsapi.smith.langchain.com
Step 2: Create an Evaluation Dataset
A dataset is a collection of (input, reference_output) pairs. Reference outputs are optional but required for correctness-based evaluators.
from langsmith import Client
client = Client()
# Define your examples — these are the inputs your LLM will be tested on
examples = [
{
"inputs": {"question": "What is the capital of France?"},
"outputs": {"answer": "Paris"},
},
{
"inputs": {"question": "Who wrote Hamlet?"},
"outputs": {"answer": "William Shakespeare"},
},
{
"inputs": {"question": "What does RAG stand for in AI?"},
"outputs": {"answer": "Retrieval-Augmented Generation"},
},
{
"inputs": {"question": "What year was Python first released?"},
"outputs": {"answer": "1991"},
},
]
dataset_name = "qa-factual-v1"
# Creates the dataset if it doesn't exist; safe to re-run
if not client.has_dataset(dataset_name=dataset_name):
dataset = client.create_dataset(dataset_name=dataset_name)
client.create_examples(
inputs=[e["inputs"] for e in examples],
outputs=[e["outputs"] for e in examples],
dataset_id=dataset.id,
)
print(f"Created dataset with {len(examples)} examples")
else:
print("Dataset already exists — skipping creation")
Tip: In production, populate datasets from real traces. LangSmith's UI lets you add any logged run directly to a dataset with one click — no manual data entry.
Step 3: Define the Target Function
The target function is what you're evaluating. It takes a dict of inputs and returns a dict of outputs. This is where your LLM call lives.
from openai import OpenAI
openai_client = OpenAI()
def my_llm(inputs: dict) -> dict:
"""
Wraps an OpenAI call. LangSmith passes each dataset example's
'inputs' dict here and records the returned 'outputs' dict.
"""
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "Answer the question concisely. One sentence maximum.",
},
{
"role": "user",
"content": inputs["question"],
},
],
temperature=0, # Deterministic outputs make evaluation more stable
)
return {"answer": response.choices[0].message.content.strip()}
Why temperature=0: Evaluation results are more reliable when outputs are consistent across runs. Use your production temperature in final benchmarks, not during iteration.
Step 4: Run an Evaluation with Built-In Evaluators
LangSmith ships several ready-to-use evaluators. LangChainStringEvaluator("qa") checks if the output answers the question correctly relative to the reference.
from langsmith.evaluation import evaluate, LangChainStringEvaluator
# "qa" evaluator uses an LLM judge to compare output vs reference answer
qa_evaluator = LangChainStringEvaluator(
"qa",
config={"llm": openai_client}, # The LLM used as judge
)
results = evaluate(
my_llm, # Target function
data=dataset_name, # Dataset name or ID
evaluators=[qa_evaluator],
experiment_prefix="gpt-4o-mini-baseline",
metadata={"model": "gpt-4o-mini", "temperature": 0},
)
print(results.to_pandas())
Expected output:
input.question output.answer feedback.qa
0 What is the capital... Paris 1.0
1 Who wrote Hamlet? William Shak... 1.0
2 What does RAG stand... Retrieval-Aug.. 1.0
3 What year was Python... 1991 1.0
Score of 1.0 = correct, 0.0 = incorrect. The results also appear in the LangSmith UI under Experiments.
Step 5: Write a Custom Evaluator
Built-in evaluators cover correctness, but you often need custom checks — response length, JSON validity, tone, or domain-specific rules.
from langsmith.schemas import Run, Example
def brevity_evaluator(run: Run, example: Example) -> dict:
"""
Penalizes answers longer than 20 words.
Returns a score between 0.0 and 1.0.
"""
output = run.outputs.get("answer", "")
word_count = len(output.split())
# Score decays linearly after 20 words; floors at 0
score = max(0.0, 1.0 - max(0, word_count - 20) / 20)
return {
"key": "brevity",
"score": round(score, 2),
"comment": f"{word_count} words",
}
# Run again with both evaluators
results = evaluate(
my_llm,
data=dataset_name,
evaluators=[qa_evaluator, brevity_evaluator],
experiment_prefix="gpt-4o-mini-brevity-check",
metadata={"model": "gpt-4o-mini", "temperature": 0},
)
Custom evaluators receive the full Run object (which includes inputs, outputs, and latency) and the Example (inputs + reference outputs). You can make LLM calls inside them for LLM-as-judge patterns.
Step 6: Compare Experiments
Now swap the model and run again — this is where the value compounds.
def my_llm_upgraded(inputs: dict) -> dict:
"""Same function, different model — compare against baseline."""
response = openai_client.chat.completions.create(
model="gpt-4o", # Upgraded from gpt-4o-mini
messages=[
{"role": "system", "content": "Answer the question concisely. One sentence maximum."},
{"role": "user", "content": inputs["question"]},
],
temperature=0,
)
return {"answer": response.choices[0].message.content.strip()}
results_v2 = evaluate(
my_llm_upgraded,
data=dataset_name,
evaluators=[qa_evaluator, brevity_evaluator],
experiment_prefix="gpt-4o-upgraded",
metadata={"model": "gpt-4o", "temperature": 0},
)
In the LangSmith UI, open Datasets & Testing → your dataset → Compare Experiments to see a side-by-side score table across all runs.
Step 7: Add Evaluation to CI
Run evaluations on every PR to catch regressions before they ship.
# eval_ci.py — run this in your CI pipeline
import sys
from langsmith.evaluation import evaluate, LangChainStringEvaluator
from openai import OpenAI
openai_client = OpenAI()
def my_llm(inputs):
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Answer the question concisely."},
{"role": "user", "content": inputs["question"]},
],
temperature=0,
)
return {"answer": response.choices[0].message.content.strip()}
qa_evaluator = LangChainStringEvaluator("qa", config={"llm": openai_client})
results = evaluate(
my_llm,
data="qa-factual-v1",
evaluators=[qa_evaluator],
experiment_prefix="ci-run",
)
df = results.to_pandas()
avg_qa_score = df["feedback.qa"].mean()
print(f"Average QA score: {avg_qa_score:.2f}")
# Fail the CI pipeline if quality drops below threshold
PASS_THRESHOLD = 0.85
if avg_qa_score < PASS_THRESHOLD:
print(f"FAIL: Score {avg_qa_score:.2f} below threshold {PASS_THRESHOLD}")
sys.exit(1)
print("PASS: Quality threshold met")
sys.exit(0)
Add to .github/workflows/eval.yml:
name: LLM Evaluation
on:
pull_request:
paths:
- "src/llm/**"
- "prompts/**"
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install langsmith openai
- run: python eval_ci.py
env:
LANGSMITH_API_KEY: ${{ secrets.LANGSMITH_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
The pipeline now fails automatically if your LLM changes drop average QA score below 0.85.
Verification
Run a full eval end-to-end and check the results:
df = results.to_pandas()
print(df[["input.question", "output.answer", "feedback.qa", "feedback.brevity"]])
print(f"\nMean QA score: {df['feedback.qa'].mean():.2f}")
print(f"Mean brevity score: {df['feedback.brevity'].mean():.2f}")
You should see: A DataFrame with one row per dataset example, scores populated in every feedback column, no NaN values.
Check the LangSmith UI at smith.langchain.com → your project → Experiments to see the run logged with all metadata and per-example scores.
What You Learned
- LangSmith evaluations need three things: a dataset, a target function, and evaluators
- Built-in evaluators like
"qa"use an LLM-as-judge pattern — they cost tokens but scale to subjective quality checks - Custom evaluators are plain Python functions — use them for deterministic checks (length, format, regex) with zero latency overhead
- Comparing experiments by model, prompt, or temperature is where the real value is — one eval run in isolation tells you little
- CI integration turns evaluation from a manual chore into a quality gate
Limitation: LLM-as-judge evaluators ("qa", "cot_qa", "criteria") add latency and cost proportional to your dataset size. For large datasets (1000+ examples), run them nightly rather than on every PR — use fast deterministic evaluators in the hot path instead.
Tested on LangSmith SDK 0.2.x, Python 3.12, OpenAI gpt-4o-mini and gpt-4o, Ubuntu 24.04