LangSmith Annotation Queues: Collect Human Feedback at Scale

What Are Annotation Queues and Why They Matter in 2026

LLM evaluation has two sides: automated metrics (correctness scores, latency, token cost) and human judgment (does this answer actually help the user?). Automated evals are fast and cheap. Human evals are slow and expensive — but they catch what metrics miss.

LangSmith annotation queues bridge that gap. They let you route specific traces — bad runs, edge cases, low-confidence outputs — to a human reviewer inside a structured interface. Reviewers score, label, and comment. You get feedback stored as structured data, tied directly to the trace that triggered it.

In 2026, as LLM apps move into regulated or high-stakes domains, "we ran evals" is not enough. You need a paper trail of human oversight. Annotation queues are how you build it.

How Annotation Queues Work

The mental model is simple: a queue is a filtered inbox for traces.

Your LLM app
     │
     ▼
LangSmith Tracing (all runs logged)
     │
     ├─── Automated evals run on every trace
     │
     └─── Filter rule matches → trace added to queue
                                        │
                                        ▼
                              Human reviewer opens queue
                              Scores trace on rubric
                              Submits feedback
                                        │
                                        ▼
                              Feedback stored as structured data
                              Available via SDK / export

Three components make this work:

1. The queue itself — a named workspace with a defined rubric (what reviewers score) and assigned reviewers.

2. Routing rules — conditions that add traces to the queue automatically. You can also add traces manually or via the SDK.

3. The annotation UI — a focused interface showing input, output, and the scoring form. Reviewers never need to touch the raw trace data.

Step 1: Create an Annotation Queue

Navigate to your LangSmith project → Annotation Queues → New Queue.

Give it a name that reflects the scope: production-rag-review, customer-facing-chat, code-gen-qa.

Define the Rubric

The rubric is the scoring schema reviewers fill out. LangSmith supports three field types:

Field type	Use for	Example
`categorical`	Multi-class labels	Thumbs up / down, quality tier
`continuous`	Numeric scores	1–5 helpfulness rating
`freeform`	Open text	"What was wrong with this response?"

A minimal rubric for a RAG chatbot:

{
  "fields": [
    {
      "name": "overall_quality",
      "type": "categorical",
      "options": ["good", "acceptable", "bad"],
      "required": true
    },
    {
      "name": "groundedness",
      "type": "categorical",
      "options": ["grounded", "hallucinated", "partially_grounded"],
      "required": true
    },
    {
      "name": "reviewer_note",
      "type": "freeform",
      "required": false
    }
  ]
}

Keep rubrics short. Five fields or fewer per queue. Long rubrics cause reviewer fatigue and inconsistent scores.

Step 2: Route Traces to the Queue

Option A — Manual Routing via the UI

Open any trace in LangSmith → click Add to Queue → select your queue. Use this for ad hoc reviews or during setup testing.

Option B — Filter Rules (Recommended for Production)

In the queue settings, define a filter that automatically routes matching traces. Useful filters:

# Route all traces where automated eval scored < 0.5
metadata.eval_score < 0.5

# Route traces with user thumbs-down feedback
feedback.user_rating == "negative"

# Route traces that took more than 10 seconds
latency_ms > 10000

# Route a 5% random sample for ongoing quality monitoring
random_sample == 0.05

Combine filters with AND/OR logic. The most practical production pattern is union two rules: low eval score OR negative user signal. That way you catch both automated failures and real user complaints.

Option C — SDK Routing

Add traces programmatically when your app detects uncertainty:

from langsmith import Client

client = Client()

def route_to_review(run_id: str, queue_name: str, reason: str) -> None:
    # Add run to annotation queue with a comment explaining why it was routed
    client.add_runs_to_annotation_queue(
        queue_name=queue_name,
        run_ids=[run_id],
        # Optional: attach initial context for the reviewer
        comment=f"Auto-routed: {reason}"
    )

# In your LLM chain — route when confidence is low
if confidence_score < 0.6:
    route_to_review(
        run_id=current_run.id,
        queue_name="production-rag-review",
        reason=f"Low confidence: {confidence_score:.2f}"
    )

Get the current run ID inside a LangChain chain using the run tree context:

from langchain_core.callbacks import get_run_tree_context

# Inside a RunnableLambda or custom chain node
run_tree = get_run_tree_context()
if run_tree:
    run_id = str(run_tree.id)

Step 3: Configure Reviewer Access

Queues support two access levels:

Project member — can see all queues in the project
Queue-specific — can only see assigned queues (use for external reviewers or contractors)

Invite reviewers at Project Settings → Members. Assign them to specific queues from the queue settings panel.

For external subject-matter experts who shouldn't see your full trace history, use queue-specific access. They get a focused annotation view with no access to raw project data.

Step 4: The Reviewer Workflow

Reviewers navigate to Annotation Queues and open their assigned queue. Each item shows:

The full input (user message, retrieved context, system prompt)
The LLM output
Any metadata you attached to the trace
The scoring rubric on the right panel

Keyboard shortcuts speed up high-volume review:

Key	Action
`→`	Next item
`←`	Previous item
`S`	Submit current annotation
`E`	Expand trace details

Aim for reviewers to complete 30–50 items per hour on a focused rubric. If they're slower, the rubric is too complex or the task context is unclear — add a reviewer guide as queue description text.

Step 5: Access Feedback Data

Annotations are stored as structured feedback on the run. Pull them via the SDK for analysis or fine-tuning datasets:

from langsmith import Client
from datetime import datetime, timedelta

client = Client()

# Fetch all annotations from a queue in the last 7 days
runs = client.list_runs(
    project_name="your-project",
    filter='has_feedback_key("overall_quality")',
    start_time=datetime.now() - timedelta(days=7),
)

feedback_rows = []
for run in runs:
    feedback = client.list_feedback(run_ids=[str(run.id)])
    for fb in feedback:
        feedback_rows.append({
            "run_id": str(run.id),
            "input": run.inputs,
            "output": run.outputs,
            "score_key": fb.key,
            "score_value": fb.value,
            "comment": fb.comment,
            "reviewer": fb.created_by,
        })

# feedback_rows is now ready for analysis or dataset export

Export as a Fine-Tuning Dataset

LangSmith can export annotated runs directly as a dataset for fine-tuning or few-shot examples:

# Create a dataset from runs with "good" annotation
good_run_ids = [
    row["run_id"]
    for row in feedback_rows
    if row["score_key"] == "overall_quality" and row["score_value"] == "good"
]

dataset = client.create_dataset(
    dataset_name="rag-good-examples-2026-q1",
    description="Human-verified good RAG responses from production"
)

client.add_runs_to_dataset(
    dataset_id=dataset.id,
    run_ids=good_run_ids,
)

This dataset can feed directly into LangSmith's evaluation harness or be exported as JSONL for fine-tuning with Unsloth or Axolotl.

Production Considerations

Review latency vs. volume tradeoff. Don't route everything. A 100k-run/day app sending 10% to annotation = 10k items/day. At 40 items/reviewer/hour, you need 250 reviewer-hours per day to keep up. Route 0.5–1% for quality sampling, plus auto-triggered items from low scores. 500–1000 items/day is manageable for a small team.

Rubric drift. Reviewers interpret labels differently over time. Run calibration sessions every 4–6 weeks: have all reviewers score the same 20 items independently, then compare. Disagreement > 20% means your rubric needs clarification.

Feedback freshness. Annotation queues are most valuable when you close the loop quickly. If annotated data sits unused for months, you're collecting signal without acting on it. Set a monthly cycle: collect → analyze → update evals or retrain → repeat.

Queue backlog. A growing backlog signals either too much routing or too few reviewers. Monitor queue depth weekly. If backlog grows > 3 days of throughput, tighten your routing filters first — don't just add reviewers.

Summary

Annotation queues turn ad hoc human review into a structured, repeatable process
Define a short rubric (≤5 fields) focused on what automated evals miss
Route with filter rules: low eval score + negative user signal covers 80% of the value
Use SDK routing for confidence-based flagging inside your chain
Pull feedback as structured data for analysis, fine-tuning datasets, or eval benchmarks
Close the loop monthly — annotation data is only useful if you act on it

Tested on LangSmith 0.2.x SDK, Python 3.12, LangChain 0.3.x