LangSmith vs Langfuse vs Helicone: TL;DR
| LangSmith | Langfuse | Helicone | |
|---|---|---|---|
| Primary focus | Tracing + evals for LangChain apps | Full observability + evals, framework-agnostic | Cost tracking + request proxy |
| Self-host | ✅ Paid plan only | ✅ Free (Docker / Kubernetes) | ✅ Free (Docker) |
| Evals | ✅ Built-in + human annotation | ✅ LLM-as-judge + human scoring | ⚠️ Basic only |
| LangChain native | ✅ First-party | ✅ Via SDK | ✅ Via proxy |
| Free tier | 5k traces/mo | Unlimited (self-hosted) | 100k requests/mo |
| Cloud pricing | From $39/mo | From $59/mo | From $20/mo |
| Best for | LangChain / LangGraph teams | Any LLM stack, self-host priority | Cost-conscious teams, simple proxy setup |
Choose LangSmith if: you're building with LangChain or LangGraph and want first-party tracing with no extra setup.
Choose Langfuse if: you want full observability plus evals on any framework, and prefer owning your data via self-hosting.
Choose Helicone if: you want fast request-level monitoring and cost tracking with minimal integration work.
What We're Comparing
Shipping an LLM app without observability is flying blind. You can't debug why a chain failed, catch prompt regressions, or track which model version is burning your budget. In 2026, three tools dominate this space for developers: LangSmith, Langfuse, and Helicone. They overlap but solve different problems.
LangSmith Overview
LangSmith is LangChain's official observability platform. It auto-instruments any LangChain or LangGraph application with near-zero configuration — wrap your existing chain and traces appear in the dashboard immediately.
Beyond tracing, LangSmith has the most mature eval workflow of the three. You can run automated evals, set up human annotation queues, and A/B test prompt versions against labeled datasets. If your team already runs LangChain in production, this is the native choice.
Pros:
- Zero-config tracing for LangChain and LangGraph — no manual span creation
- Best-in-class eval tooling with dataset management and annotation UI
- Prompt versioning and playground baked into the same platform
- Strong LangGraph debugging: visualise graph execution step by step
Cons:
- Self-hosting requires the Enterprise plan (no public Docker image for the full stack)
- Limited value if you're not on LangChain — SDK-only integration is more effort than competitors
- Free tier caps at 5,000 traces per month, which production apps exhaust quickly
- Pricing scales steeply for high-volume tracing
Langfuse Overview
Langfuse is the framework-agnostic alternative. It works via a lightweight SDK (Python, TypeScript) or OpenAI-compatible proxy, and integrates with LangChain, LlamaIndex, OpenAI, Anthropic, Vercel AI SDK, and others. The self-hosted version is fully open-source under MIT, runs on Docker Compose in under five minutes, and has no feature differences from the cloud version.
Langfuse matches LangSmith on evals — it supports LLM-as-judge scoring, human annotation, and custom numeric metrics attached to any trace. It also has a prompt management UI and a generous free tier on the self-hosted path.
Pros:
- Fully open-source, self-host for free with zero feature restrictions
- Works with any LLM provider or framework via SDK or proxy
- LLM-as-judge evals with customisable scoring rubrics
- Active development pace — the GitHub repo ships multiple releases per month
- Scores and feedback can be piped back into datasets for continuous eval loops
Cons:
- No first-party LangGraph visualisation (traces show as flat spans for graph steps)
- UI is less polished than LangSmith for annotation workflows
- Self-hosting means you manage uptime, backups, and upgrades
- The Python SDK adds ~10–20ms latency to traces if not using async flushing
Helicone Overview
Helicone takes a fundamentally different approach: it sits as a proxy in front of your LLM API calls. You change one base URL and all requests are logged, no SDK required. This makes it the fastest to integrate by far — under two minutes for an OpenAI app.
The tradeoff is depth. Helicone excels at request-level logging, cost breakdowns, latency percentiles, and rate limiting. It's weaker on complex chain tracing and has basic eval support compared to LangSmith or Langfuse.
Pros:
- Two-line integration: change
base_urland you're done - Best cost visibility — per-model, per-user, per-session cost dashboards out of the box
- Built-in rate limiting and caching at the proxy layer
- Free tier is generous at 100k requests per month
- Works with any provider that has an OpenAI-compatible API
Cons:
- Proxy architecture adds one extra network hop (~5–15ms p99 latency increase)
- No native multi-step chain or agent tracing — each LLM call logs independently
- Eval capabilities are minimal; no annotation queues or dataset management
- Self-hosting is available but the proxy architecture complicates on-prem deployments
Head-to-Head: Key Dimensions
Tracing Depth
For a simple chatbot making single LLM calls, all three tools give you what you need: latency, tokens, cost, and input/output logging.
For agents and multi-step chains, the gap opens up. LangSmith traces a LangGraph execution as a tree — you see each node, its inputs and outputs, the edges taken, and total latency per step. Langfuse shows nested spans that require manual instrumentation for non-LangChain frameworks. Helicone shows individual LLM calls without the orchestration context.
# LangSmith — zero config for LangChain
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-key"
# That's it. All LangChain calls are traced automatically.
# Langfuse — explicit span wrapping for non-LangChain code
from langfuse import Langfuse
langfuse = Langfuse()
with langfuse.trace(name="rag-pipeline") as trace:
with trace.span(name="retrieve") as span:
docs = retriever.invoke(query)
span.end(output={"doc_count": len(docs)})
with trace.span(name="generate") as span:
result = llm.invoke(prompt)
span.end(output=result)
# Helicone — proxy, no spans needed (but no chain context either)
from openai import OpenAI
client = OpenAI(
base_url="https://oai.helicone.ai/v1",
default_headers={"Helicone-Auth": "Bearer your-key"}
)
Evals
LangSmith and Langfuse are roughly equivalent on evals. Both support:
- LLM-as-judge scoring (define a rubric, model scores each trace)
- Human annotation with custom label schemas
- Dataset creation from production traces
- Score tracking over time to catch regressions
Helicone has basic thumbs-up/thumbs-down feedback logging but no dataset management or automated eval pipelines.
# Langfuse — attach a score to a trace programmatically
langfuse.score(
trace_id=trace.id,
name="answer-relevance",
value=0.87,
comment="Retrieved docs matched query well"
)
# LangSmith — run an eval suite against a dataset
from langsmith.evaluation import evaluate
results = evaluate(
target=my_chain.invoke,
data="my-eval-dataset",
evaluators=["qa", "context-qa"],
experiment_prefix="v2-prompt"
)
Self-Hosting
| LangSmith | Langfuse | Helicone | |
|---|---|---|---|
| Docker Compose | ❌ Enterprise only | ✅ Free | ✅ Free |
| Kubernetes helm | ❌ Enterprise only | ✅ Official chart | ✅ Community chart |
| Data ownership | Cloud only (free/pro) | Full | Full |
| Setup time | N/A (cloud) | ~5 min | ~5 min |
# Langfuse self-host — full stack in one command
git clone https://github.com/langfuse/langfuse
cd langfuse
docker compose up -d
# Helicone self-host
git clone https://github.com/Helicone/helicone
cd helicone/docker
docker compose up -d
Cost
At 500k traces per month (a modest production app), approximate monthly costs:
| LangSmith | Langfuse Cloud | Helicone | |
|---|---|---|---|
| Cloud | ~$200–$400 | ~$150–$250 | ~$50–$100 |
| Self-hosted | Not available | Infra cost only | Infra cost only |
Helicone wins on cloud cost because it logs at the request level, not the span level. A 10-step LangGraph run counts as 1 request in Helicone vs 10+ spans in LangSmith or Langfuse.
Developer Experience
LangSmith has the best out-of-box experience for LangChain users — two environment variables and you're tracing. The UI for exploring LangGraph runs is genuinely good.
Langfuse has the steepest learning curve of the three, especially when manually instrumenting non-LangChain code. But once set up, the dashboard is clean and the eval workflow is well thought out.
Helicone is the fastest to integrate from a cold start on any project. The dashboard surfaces cost breakdowns immediately, which most developers appreciate in the first session.
Which Should You Use?
Pick LangSmith when:
- Your stack is LangChain or LangGraph — the zero-config tracing is a genuine time saver
- You need mature annotation and eval tooling with a polished UI
- You're on a team that values a managed cloud service over self-hosting
Pick Langfuse when:
- You're using any LLM framework other than LangChain (OpenAI SDK, Vercel AI SDK, LlamaIndex, custom)
- Data sovereignty matters — you need traces in your own infrastructure
- You want full eval capabilities without paying for cloud at scale
Pick Helicone when:
- You want observability running in under five minutes with no SDK changes
- Cost tracking and budget controls are the primary need
- Your app makes simple LLM calls without complex agent orchestration
Use LangSmith + Langfuse together when: you run LangChain in production but want self-hosted backup storage or cross-project eval datasets — Langfuse accepts LangChain callbacks alongside its own SDK.
FAQ
Q: Does Langfuse work with LangChain?
Yes. Langfuse provides a CallbackHandler that plugs directly into any LangChain chain or LangGraph graph. You get full span tracing without changing your chain logic — just pass the callback in when invoking.
Q: Can Helicone trace multi-step agents?
Not natively. Helicone logs each LLM API call as a separate request. You can group related calls using Helicone-Session-Id headers to reconstruct a session, but you won't get the hierarchical span view that LangSmith or Langfuse provide for agentic workflows.
Q: Is LangSmith open source?
No. LangSmith is a proprietary cloud product from LangChain Inc. The LangChain Python and JS SDKs are open source, but the LangSmith backend and UI are not publicly available for self-hosting on free or pro plans.
Q: Which tool is best for a team shipping a RAG chatbot in 2026?
If you built the RAG pipeline with LangChain or LlamaIndex and don't need self-hosting, LangSmith is the fastest path. If you used the OpenAI SDK directly or want to own your data, Langfuse is the stronger choice. Add Helicone if your primary concern is cost tracking per user — the proxy layer integrates alongside either.