MCP Sampling: How LLMs Call Other LLMs via Servers

MCP sampling lets AI servers request LLM completions through the host client. Learn how it works, when to use it, and how to implement it in 2026.

What Is MCP Sampling and Why It Changes Agent Architecture

Most MCP servers are passive — they expose tools and resources, then wait for a client to call them. Sampling flips this. With sampling, the server asks the client to run an LLM inference, then uses the result to continue its own logic.

This means a tool server can reason mid-execution without bundling its own model. The host client (Claude Desktop, Cursor, your custom app) provides the LLM. The server stays stateless and model-agnostic.

You'll learn:

  • Exactly how the MCP sampling request/response cycle works
  • When sampling is the right architecture versus tool chaining
  • How to implement a sampling-capable server in TypeScript
  • How to handle sampling in a custom MCP host client

Time: 20 min | Difficulty: Advanced


How MCP Sampling Works

In standard MCP flow, the client drives everything: it calls tools, reads resources, and runs prompts. Sampling introduces a reverse channel.

Client (host LLM)
    │
    │  1. User triggers agent task
    ▼
MCP Server (your tool)
    │
    │  2. Server sends sampling/createMessage request
    ▼
Client (host LLM again)
    │
    │  3. Client runs inference, returns completion
    ▼
MCP Server
    │
    │  4. Server uses result to continue logic
    ▼
Client  ◀── 5. Final tool response returned

The server never directly calls an LLM API. It delegates inference back to the host. This is intentional — it keeps the human in the loop and lets the client enforce its own model policies, rate limits, and safety filters.

The Sampling Request Object

When a server wants an LLM completion, it sends a sampling/createMessage request:

{
  "method": "sampling/createMessage",
  "params": {
    "messages": [
      {
        "role": "user",
        "content": {
          "type": "text",
          "text": "Summarize this error log in one sentence: ..."
        }
      }
    ],
    "modelPreferences": {
      "hints": [{ "name": "claude-sonnet" }],
      "costPriority": 0.3,
      "speedPriority": 0.8,
      "intelligencePriority": 0.5
    },
    "systemPrompt": "You are a concise log analyzer.",
    "maxTokens": 200
  }
}

The client decides which model to actually use. modelPreferences are hints, not requirements. A client running on Claude Haiku might ignore a hint for Opus if cost or latency policies apply.

The Sampling Response

{
  "role": "assistant",
  "content": {
    "type": "text",
    "text": "The service crashed due to a null pointer in the auth middleware at line 47."
  },
  "model": "claude-haiku-4-5-20251001",
  "stopReason": "endTurn"
}

The server gets back the completion, the model that ran it, and why generation stopped.


When to Use Sampling vs Tool Chaining

Sampling is not always the right call. Here's how to choose:

ScenarioUse samplingUse tool chaining
Server needs to classify/interpret data mid-execution
Server needs structured JSON from unstructured text
Logic is fully deterministic (no LLM needed)
You control both client and server
Server is distributed to users with different clients✅ (model-agnostic)
You need reproducible, testable output

Use sampling when your server logic has a step that needs language understanding — classification, summarization, intent extraction — and you don't want to hardcode an LLM provider inside the server.

Avoid sampling when the logic is purely computational. Adding an LLM round-trip for something regex can solve is slower and less reliable.


Implementation: Sampling-Capable MCP Server

Install the MCP TypeScript SDK:

npm install @modelcontextprotocol/sdk

Step 1: Set Up the Server with Sampling Capability

import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import {
  CallToolRequestSchema,
  CreateMessageRequestSchema,
} from "@modelcontextprotocol/sdk/types.js";

const server = new Server(
  { name: "log-analyzer", version: "1.0.0" },
  {
    capabilities: {
      tools: {},
      sampling: {}, // Declare that this server will use sampling
    },
  }
);

Declaring sampling: {} in capabilities tells the client that this server will issue sampling/createMessage requests. A client that doesn't support sampling can reject the connection early rather than failing mid-execution.

Step 2: Define a Tool That Triggers Sampling

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (request.params.name !== "analyze-log") {
    throw new Error(`Unknown tool: ${request.params.name}`);
  }

  const rawLog = request.params.arguments?.log as string;
  if (!rawLog) throw new Error("Missing required argument: log");

  // Use sampling to get an LLM interpretation of the log
  const samplingResult = await server.createMessage({
    messages: [
      {
        role: "user",
        content: {
          type: "text",
          text: `Analyze this log entry and return a JSON object with fields:
- severity: "critical" | "warning" | "info"
- root_cause: string (one sentence)
- action_required: boolean

Log:
${rawLog}

Return only valid JSON. No markdown.`,
        },
      },
    ],
    modelPreferences: {
      // Prefer fast, cheap model — this is classification, not reasoning
      speedPriority: 0.9,
      costPriority: 0.8,
      intelligencePriority: 0.3,
    },
    systemPrompt: "You are a log analysis API. Return only valid JSON.",
    maxTokens: 300,
  });

  // Parse the LLM's structured response
  const content = samplingResult.content;
  if (content.type !== "text") {
    throw new Error("Expected text response from sampling");
  }

  let parsed: { severity: string; root_cause: string; action_required: boolean };
  try {
    parsed = JSON.parse(content.text);
  } catch {
    throw new Error(`Sampling returned invalid JSON: ${content.text}`);
  }

  return {
    content: [
      {
        type: "text",
        text: JSON.stringify({
          analysis: parsed,
          model_used: samplingResult.model,
          raw_log_length: rawLog.length,
        }, null, 2),
      },
    ],
  };
});

Step 3: Start the Server

async function main() {
  const transport = new StdioServerTransport();
  await server.connect(transport);
  console.error("log-analyzer MCP server running");
}

main().catch(console.error);

Implementation: Handling Sampling in a Custom Host Client

If you're building a custom MCP host (not using Claude Desktop), you must implement the sampling handler. Without it, any server that calls sampling/createMessage will receive an error.

import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic();

// Register the sampling handler before connecting
client.setRequestHandler(CreateMessageRequestSchema, async (request) => {
  const { messages, systemPrompt, maxTokens, modelPreferences } = request.params;

  // Model selection: respect hints if you can, fall back to your default
  const hintedModel = modelPreferences?.hints?.[0]?.name;
  const model = resolveModel(hintedModel); // your own routing logic

  const response = await anthropic.messages.create({
    model,
    max_tokens: maxTokens ?? 1024,
    system: systemPrompt,
    messages: messages.map((m) => ({
      role: m.role,
      content: typeof m.content === "string" ? m.content : m.content.text,
    })),
  });

  const textBlock = response.content.find((b) => b.type === "text");

  return {
    role: "assistant",
    content: {
      type: "text",
      text: textBlock?.text ?? "",
    },
    model: response.model,
    stopReason: response.stop_reason ?? "endTurn",
  };
});

function resolveModel(hint?: string): string {
  // Map hints to real model strings
  const modelMap: Record<string, string> = {
    "claude-sonnet": "claude-sonnet-4-6",
    "claude-haiku": "claude-haiku-4-5-20251001",
    "claude-opus": "claude-opus-4-6",
  };
  return hint && modelMap[hint] ? modelMap[hint] : "claude-haiku-4-5-20251001";
}

Key points:

  • The handler must be registered before client.connect()
  • You own model selection — treat modelPreferences as advisory
  • Return a valid CreateMessageResult or the server's sampling call throws

Production Considerations

Latency Compounds

Each sampling call adds a full LLM round-trip. A server that calls sampling three times in sequence adds 3× the latency of a single inference. Design sampling calls to be single-shot where possible — use structured output prompts that return everything you need in one response.

Sampling Loops Are a Risk

A server could theoretically call sampling, interpret the result, call sampling again, and loop indefinitely. Most clients impose a per-request sampling limit. If you're building a host client, enforce a max sampling depth (5 is a reasonable default).

// Track sampling depth per tool invocation
let samplingCallCount = 0;
const MAX_SAMPLING_CALLS = 5;

client.setRequestHandler(CreateMessageRequestSchema, async (request) => {
  samplingCallCount++;
  if (samplingCallCount > MAX_SAMPLING_CALLS) {
    throw new Error("Sampling depth limit exceeded");
  }
  // ... rest of handler
});

Human-in-the-Loop

The MCP spec explicitly allows clients to surface sampling requests to users before executing them. Claude Desktop does this by default for servers that aren't trusted. If your server handles sensitive data, document what sampling prompts it sends — users should be able to audit what their host LLM is seeing.

Error Handling in the Server

Sampling can fail — the client might timeout, reject the model hint, or return malformed JSON if you asked for structured output. Always wrap createMessage calls:

let samplingResult;
try {
  samplingResult = await server.createMessage({ ... });
} catch (err) {
  // Degrade gracefully — fall back to deterministic logic or return an error result
  return {
    content: [{ type: "text", text: `Analysis unavailable: ${err.message}` }],
    isError: true,
  };
}

Verification

Test your sampling server against Claude Desktop by adding it to claude_desktop_config.json:

{
  "mcpServers": {
    "log-analyzer": {
      "command": "node",
      "args": ["/path/to/your/server/dist/index.js"]
    }
  }
}

Then ask Claude to use the tool:

Analyze this log with the log-analyzer tool:
[ERROR] 2026-03-10T08:42:11Z auth-service: Cannot read property 'userId' of undefined at middleware/auth.js:47

You should see: Claude invoking analyze-log, the server issuing a sampling request back to Claude, and a structured JSON result returned to Claude's context.

Check that the sampling round-trip completes by adding a log line inside your sampling handler:

console.error(`[sampling] model=${samplingResult.model} tokens_used=~${content.text.length / 4}`);

What You Learned

  • Sampling inverts the normal MCP flow — the server requests inference from the client, not the other way around
  • modelPreferences are hints only; the client decides the actual model
  • Custom host clients must register a CreateMessageRequestSchema handler or sampling calls will fail
  • Sampling adds latency — design for single-shot structured output, not iterative back-and-forth
  • Production hosts should cap sampling depth to prevent runaway loops

When not to use sampling: If your server only needs structured data transformations, regex, or deterministic logic, skip sampling entirely. Reserve it for the steps that genuinely require language understanding.

Tested with @modelcontextprotocol/sdk 1.x, Claude Desktop 0.10+, and a custom Node.js host client on Ubuntu 24.04 and macOS Sequoia