What Are Annotation Queues and Why They Matter in 2026
LLM evaluation has two sides: automated metrics (correctness scores, latency, token cost) and human judgment (does this answer actually help the user?). Automated evals are fast and cheap. Human evals are slow and expensive — but they catch what metrics miss.
LangSmith annotation queues bridge that gap. They let you route specific traces — bad runs, edge cases, low-confidence outputs — to a human reviewer inside a structured interface. Reviewers score, label, and comment. You get feedback stored as structured data, tied directly to the trace that triggered it.
In 2026, as LLM apps move into regulated or high-stakes domains, "we ran evals" is not enough. You need a paper trail of human oversight. Annotation queues are how you build it.
How Annotation Queues Work
The mental model is simple: a queue is a filtered inbox for traces.
Your LLM app
│
▼
LangSmith Tracing (all runs logged)
│
├─── Automated evals run on every trace
│
└─── Filter rule matches → trace added to queue
│
▼
Human reviewer opens queue
Scores trace on rubric
Submits feedback
│
▼
Feedback stored as structured data
Available via SDK / export
Three components make this work:
1. The queue itself — a named workspace with a defined rubric (what reviewers score) and assigned reviewers.
2. Routing rules — conditions that add traces to the queue automatically. You can also add traces manually or via the SDK.
3. The annotation UI — a focused interface showing input, output, and the scoring form. Reviewers never need to touch the raw trace data.
Step 1: Create an Annotation Queue
Navigate to your LangSmith project → Annotation Queues → New Queue.
Give it a name that reflects the scope: production-rag-review, customer-facing-chat, code-gen-qa.
Define the Rubric
The rubric is the scoring schema reviewers fill out. LangSmith supports three field types:
| Field type | Use for | Example |
|---|---|---|
categorical | Multi-class labels | Thumbs up / down, quality tier |
continuous | Numeric scores | 1–5 helpfulness rating |
freeform | Open text | "What was wrong with this response?" |
A minimal rubric for a RAG chatbot:
{
"fields": [
{
"name": "overall_quality",
"type": "categorical",
"options": ["good", "acceptable", "bad"],
"required": true
},
{
"name": "groundedness",
"type": "categorical",
"options": ["grounded", "hallucinated", "partially_grounded"],
"required": true
},
{
"name": "reviewer_note",
"type": "freeform",
"required": false
}
]
}
Keep rubrics short. Five fields or fewer per queue. Long rubrics cause reviewer fatigue and inconsistent scores.
Step 2: Route Traces to the Queue
Option A — Manual Routing via the UI
Open any trace in LangSmith → click Add to Queue → select your queue. Use this for ad hoc reviews or during setup testing.
Option B — Filter Rules (Recommended for Production)
In the queue settings, define a filter that automatically routes matching traces. Useful filters:
# Route all traces where automated eval scored < 0.5
metadata.eval_score < 0.5
# Route traces with user thumbs-down feedback
feedback.user_rating == "negative"
# Route traces that took more than 10 seconds
latency_ms > 10000
# Route a 5% random sample for ongoing quality monitoring
random_sample == 0.05
Combine filters with AND/OR logic. The most practical production pattern is union two rules: low eval score OR negative user signal. That way you catch both automated failures and real user complaints.
Option C — SDK Routing
Add traces programmatically when your app detects uncertainty:
from langsmith import Client
client = Client()
def route_to_review(run_id: str, queue_name: str, reason: str) -> None:
# Add run to annotation queue with a comment explaining why it was routed
client.add_runs_to_annotation_queue(
queue_name=queue_name,
run_ids=[run_id],
# Optional: attach initial context for the reviewer
comment=f"Auto-routed: {reason}"
)
# In your LLM chain — route when confidence is low
if confidence_score < 0.6:
route_to_review(
run_id=current_run.id,
queue_name="production-rag-review",
reason=f"Low confidence: {confidence_score:.2f}"
)
Get the current run ID inside a LangChain chain using the run tree context:
from langchain_core.callbacks import get_run_tree_context
# Inside a RunnableLambda or custom chain node
run_tree = get_run_tree_context()
if run_tree:
run_id = str(run_tree.id)
Step 3: Configure Reviewer Access
Queues support two access levels:
- Project member — can see all queues in the project
- Queue-specific — can only see assigned queues (use for external reviewers or contractors)
Invite reviewers at Project Settings → Members. Assign them to specific queues from the queue settings panel.
For external subject-matter experts who shouldn't see your full trace history, use queue-specific access. They get a focused annotation view with no access to raw project data.
Step 4: The Reviewer Workflow
Reviewers navigate to Annotation Queues and open their assigned queue. Each item shows:
- The full input (user message, retrieved context, system prompt)
- The LLM output
- Any metadata you attached to the trace
- The scoring rubric on the right panel
Keyboard shortcuts speed up high-volume review:
| Key | Action |
|---|---|
→ | Next item |
← | Previous item |
S | Submit current annotation |
E | Expand trace details |
Aim for reviewers to complete 30–50 items per hour on a focused rubric. If they're slower, the rubric is too complex or the task context is unclear — add a reviewer guide as queue description text.
Step 5: Access Feedback Data
Annotations are stored as structured feedback on the run. Pull them via the SDK for analysis or fine-tuning datasets:
from langsmith import Client
from datetime import datetime, timedelta
client = Client()
# Fetch all annotations from a queue in the last 7 days
runs = client.list_runs(
project_name="your-project",
filter='has_feedback_key("overall_quality")',
start_time=datetime.now() - timedelta(days=7),
)
feedback_rows = []
for run in runs:
feedback = client.list_feedback(run_ids=[str(run.id)])
for fb in feedback:
feedback_rows.append({
"run_id": str(run.id),
"input": run.inputs,
"output": run.outputs,
"score_key": fb.key,
"score_value": fb.value,
"comment": fb.comment,
"reviewer": fb.created_by,
})
# feedback_rows is now ready for analysis or dataset export
Export as a Fine-Tuning Dataset
LangSmith can export annotated runs directly as a dataset for fine-tuning or few-shot examples:
# Create a dataset from runs with "good" annotation
good_run_ids = [
row["run_id"]
for row in feedback_rows
if row["score_key"] == "overall_quality" and row["score_value"] == "good"
]
dataset = client.create_dataset(
dataset_name="rag-good-examples-2026-q1",
description="Human-verified good RAG responses from production"
)
client.add_runs_to_dataset(
dataset_id=dataset.id,
run_ids=good_run_ids,
)
This dataset can feed directly into LangSmith's evaluation harness or be exported as JSONL for fine-tuning with Unsloth or Axolotl.
Production Considerations
Review latency vs. volume tradeoff. Don't route everything. A 100k-run/day app sending 10% to annotation = 10k items/day. At 40 items/reviewer/hour, you need 250 reviewer-hours per day to keep up. Route 0.5–1% for quality sampling, plus auto-triggered items from low scores. 500–1000 items/day is manageable for a small team.
Rubric drift. Reviewers interpret labels differently over time. Run calibration sessions every 4–6 weeks: have all reviewers score the same 20 items independently, then compare. Disagreement > 20% means your rubric needs clarification.
Feedback freshness. Annotation queues are most valuable when you close the loop quickly. If annotated data sits unused for months, you're collecting signal without acting on it. Set a monthly cycle: collect → analyze → update evals or retrain → repeat.
Queue backlog. A growing backlog signals either too much routing or too few reviewers. Monitor queue depth weekly. If backlog grows > 3 days of throughput, tighten your routing filters first — don't just add reviewers.
Summary
- Annotation queues turn ad hoc human review into a structured, repeatable process
- Define a short rubric (≤5 fields) focused on what automated evals miss
- Route with filter rules: low eval score + negative user signal covers 80% of the value
- Use SDK routing for confidence-based flagging inside your chain
- Pull feedback as structured data for analysis, fine-tuning datasets, or eval benchmarks
- Close the loop monthly — annotation data is only useful if you act on it
Tested on LangSmith 0.2.x SDK, Python 3.12, LangChain 0.3.x