What Is MCP Sampling and Why It Changes Agent Architecture
Most MCP servers are passive — they expose tools and resources, then wait for a client to call them. Sampling flips this. With sampling, the server asks the client to run an LLM inference, then uses the result to continue its own logic.
This means a tool server can reason mid-execution without bundling its own model. The host client (Claude Desktop, Cursor, your custom app) provides the LLM. The server stays stateless and model-agnostic.
You'll learn:
- Exactly how the MCP sampling request/response cycle works
- When sampling is the right architecture versus tool chaining
- How to implement a sampling-capable server in TypeScript
- How to handle sampling in a custom MCP host client
Time: 20 min | Difficulty: Advanced
How MCP Sampling Works
In standard MCP flow, the client drives everything: it calls tools, reads resources, and runs prompts. Sampling introduces a reverse channel.
Client (host LLM)
│
│ 1. User triggers agent task
▼
MCP Server (your tool)
│
│ 2. Server sends sampling/createMessage request
▼
Client (host LLM again)
│
│ 3. Client runs inference, returns completion
▼
MCP Server
│
│ 4. Server uses result to continue logic
▼
Client ◀── 5. Final tool response returned
The server never directly calls an LLM API. It delegates inference back to the host. This is intentional — it keeps the human in the loop and lets the client enforce its own model policies, rate limits, and safety filters.
The Sampling Request Object
When a server wants an LLM completion, it sends a sampling/createMessage request:
{
"method": "sampling/createMessage",
"params": {
"messages": [
{
"role": "user",
"content": {
"type": "text",
"text": "Summarize this error log in one sentence: ..."
}
}
],
"modelPreferences": {
"hints": [{ "name": "claude-sonnet" }],
"costPriority": 0.3,
"speedPriority": 0.8,
"intelligencePriority": 0.5
},
"systemPrompt": "You are a concise log analyzer.",
"maxTokens": 200
}
}
The client decides which model to actually use. modelPreferences are hints, not requirements. A client running on Claude Haiku might ignore a hint for Opus if cost or latency policies apply.
The Sampling Response
{
"role": "assistant",
"content": {
"type": "text",
"text": "The service crashed due to a null pointer in the auth middleware at line 47."
},
"model": "claude-haiku-4-5-20251001",
"stopReason": "endTurn"
}
The server gets back the completion, the model that ran it, and why generation stopped.
When to Use Sampling vs Tool Chaining
Sampling is not always the right call. Here's how to choose:
| Scenario | Use sampling | Use tool chaining |
|---|---|---|
| Server needs to classify/interpret data mid-execution | ✅ | ❌ |
| Server needs structured JSON from unstructured text | ✅ | ❌ |
| Logic is fully deterministic (no LLM needed) | ❌ | ✅ |
| You control both client and server | ✅ | ✅ |
| Server is distributed to users with different clients | ✅ (model-agnostic) | ❌ |
| You need reproducible, testable output | ❌ | ✅ |
Use sampling when your server logic has a step that needs language understanding — classification, summarization, intent extraction — and you don't want to hardcode an LLM provider inside the server.
Avoid sampling when the logic is purely computational. Adding an LLM round-trip for something regex can solve is slower and less reliable.
Implementation: Sampling-Capable MCP Server
Install the MCP TypeScript SDK:
npm install @modelcontextprotocol/sdk
Step 1: Set Up the Server with Sampling Capability
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import {
CallToolRequestSchema,
CreateMessageRequestSchema,
} from "@modelcontextprotocol/sdk/types.js";
const server = new Server(
{ name: "log-analyzer", version: "1.0.0" },
{
capabilities: {
tools: {},
sampling: {}, // Declare that this server will use sampling
},
}
);
Declaring sampling: {} in capabilities tells the client that this server will issue sampling/createMessage requests. A client that doesn't support sampling can reject the connection early rather than failing mid-execution.
Step 2: Define a Tool That Triggers Sampling
server.setRequestHandler(CallToolRequestSchema, async (request) => {
if (request.params.name !== "analyze-log") {
throw new Error(`Unknown tool: ${request.params.name}`);
}
const rawLog = request.params.arguments?.log as string;
if (!rawLog) throw new Error("Missing required argument: log");
// Use sampling to get an LLM interpretation of the log
const samplingResult = await server.createMessage({
messages: [
{
role: "user",
content: {
type: "text",
text: `Analyze this log entry and return a JSON object with fields:
- severity: "critical" | "warning" | "info"
- root_cause: string (one sentence)
- action_required: boolean
Log:
${rawLog}
Return only valid JSON. No markdown.`,
},
},
],
modelPreferences: {
// Prefer fast, cheap model — this is classification, not reasoning
speedPriority: 0.9,
costPriority: 0.8,
intelligencePriority: 0.3,
},
systemPrompt: "You are a log analysis API. Return only valid JSON.",
maxTokens: 300,
});
// Parse the LLM's structured response
const content = samplingResult.content;
if (content.type !== "text") {
throw new Error("Expected text response from sampling");
}
let parsed: { severity: string; root_cause: string; action_required: boolean };
try {
parsed = JSON.parse(content.text);
} catch {
throw new Error(`Sampling returned invalid JSON: ${content.text}`);
}
return {
content: [
{
type: "text",
text: JSON.stringify({
analysis: parsed,
model_used: samplingResult.model,
raw_log_length: rawLog.length,
}, null, 2),
},
],
};
});
Step 3: Start the Server
async function main() {
const transport = new StdioServerTransport();
await server.connect(transport);
console.error("log-analyzer MCP server running");
}
main().catch(console.error);
Implementation: Handling Sampling in a Custom Host Client
If you're building a custom MCP host (not using Claude Desktop), you must implement the sampling handler. Without it, any server that calls sampling/createMessage will receive an error.
import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import Anthropic from "@anthropic-ai/sdk";
const anthropic = new Anthropic();
// Register the sampling handler before connecting
client.setRequestHandler(CreateMessageRequestSchema, async (request) => {
const { messages, systemPrompt, maxTokens, modelPreferences } = request.params;
// Model selection: respect hints if you can, fall back to your default
const hintedModel = modelPreferences?.hints?.[0]?.name;
const model = resolveModel(hintedModel); // your own routing logic
const response = await anthropic.messages.create({
model,
max_tokens: maxTokens ?? 1024,
system: systemPrompt,
messages: messages.map((m) => ({
role: m.role,
content: typeof m.content === "string" ? m.content : m.content.text,
})),
});
const textBlock = response.content.find((b) => b.type === "text");
return {
role: "assistant",
content: {
type: "text",
text: textBlock?.text ?? "",
},
model: response.model,
stopReason: response.stop_reason ?? "endTurn",
};
});
function resolveModel(hint?: string): string {
// Map hints to real model strings
const modelMap: Record<string, string> = {
"claude-sonnet": "claude-sonnet-4-6",
"claude-haiku": "claude-haiku-4-5-20251001",
"claude-opus": "claude-opus-4-6",
};
return hint && modelMap[hint] ? modelMap[hint] : "claude-haiku-4-5-20251001";
}
Key points:
- The handler must be registered before
client.connect() - You own model selection — treat
modelPreferencesas advisory - Return a valid
CreateMessageResultor the server's sampling call throws
Production Considerations
Latency Compounds
Each sampling call adds a full LLM round-trip. A server that calls sampling three times in sequence adds 3× the latency of a single inference. Design sampling calls to be single-shot where possible — use structured output prompts that return everything you need in one response.
Sampling Loops Are a Risk
A server could theoretically call sampling, interpret the result, call sampling again, and loop indefinitely. Most clients impose a per-request sampling limit. If you're building a host client, enforce a max sampling depth (5 is a reasonable default).
// Track sampling depth per tool invocation
let samplingCallCount = 0;
const MAX_SAMPLING_CALLS = 5;
client.setRequestHandler(CreateMessageRequestSchema, async (request) => {
samplingCallCount++;
if (samplingCallCount > MAX_SAMPLING_CALLS) {
throw new Error("Sampling depth limit exceeded");
}
// ... rest of handler
});
Human-in-the-Loop
The MCP spec explicitly allows clients to surface sampling requests to users before executing them. Claude Desktop does this by default for servers that aren't trusted. If your server handles sensitive data, document what sampling prompts it sends — users should be able to audit what their host LLM is seeing.
Error Handling in the Server
Sampling can fail — the client might timeout, reject the model hint, or return malformed JSON if you asked for structured output. Always wrap createMessage calls:
let samplingResult;
try {
samplingResult = await server.createMessage({ ... });
} catch (err) {
// Degrade gracefully — fall back to deterministic logic or return an error result
return {
content: [{ type: "text", text: `Analysis unavailable: ${err.message}` }],
isError: true,
};
}
Verification
Test your sampling server against Claude Desktop by adding it to claude_desktop_config.json:
{
"mcpServers": {
"log-analyzer": {
"command": "node",
"args": ["/path/to/your/server/dist/index.js"]
}
}
}
Then ask Claude to use the tool:
Analyze this log with the log-analyzer tool:
[ERROR] 2026-03-10T08:42:11Z auth-service: Cannot read property 'userId' of undefined at middleware/auth.js:47
You should see: Claude invoking analyze-log, the server issuing a sampling request back to Claude, and a structured JSON result returned to Claude's context.
Check that the sampling round-trip completes by adding a log line inside your sampling handler:
console.error(`[sampling] model=${samplingResult.model} tokens_used=~${content.text.length / 4}`);
What You Learned
- Sampling inverts the normal MCP flow — the server requests inference from the client, not the other way around
modelPreferencesare hints only; the client decides the actual model- Custom host clients must register a
CreateMessageRequestSchemahandler or sampling calls will fail - Sampling adds latency — design for single-shot structured output, not iterative back-and-forth
- Production hosts should cap sampling depth to prevent runaway loops
When not to use sampling: If your server only needs structured data transformations, regex, or deterministic logic, skip sampling entirely. Reserve it for the steps that genuinely require language understanding.
Tested with @modelcontextprotocol/sdk 1.x, Claude Desktop 0.10+, and a custom Node.js host client on Ubuntu 24.04 and macOS Sequoia