Problem: DeepSeek R1 Thinks Differently — LangChain Needs Help
DeepSeek R1 wraps its reasoning in <think>...</think> tags before answering. Standard LangChain chains pass that raw output directly to the next step, which breaks parsers, tool calls, and retrieval chains.
You'll learn:
- How to call DeepSeek R1 via the OpenAI-compatible API in LangChain
- How to strip
<think>blocks and surface clean final answers - How to build a multi-step reasoning pipeline using LCEL
- How to preserve the reasoning trace for debugging and observability
Time: 20 min | Difficulty: Intermediate
Why R1's Output Breaks Standard Chains
DeepSeek R1 is a reasoning model. Before it commits to an answer, it generates an internal scratchpad:
<think>
The user wants to know the capital of France. Paris is the capital...
Let me confirm this is correct before answering.
</think>
The capital of France is Paris.
When you pipe this into a StrOutputParser or a downstream prompt, the <think> block pollutes the context. A summarization chain fed R1's raw output will summarize the reasoning trace, not the answer. A tool-calling chain will fail to parse the action JSON hidden after several paragraphs of thought.
The fix is a thin extraction layer between R1 and the rest of your chain.
Setup
Install dependencies:
# uv (recommended) or pip
uv add langchain langchain-openai python-dotenv
# Or with pip
pip install langchain langchain-openai python-dotenv --break-system-packages
Set your API key. DeepSeek uses an OpenAI-compatible endpoint:
# .env
DEEPSEEK_API_KEY=sk-your-key-here
Verify the API is reachable:
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
load_dotenv()
llm = ChatOpenAI(
model="deepseek-reasoner",
api_key=os.getenv("DEEPSEEK_API_KEY"),
base_url="https://api.deepseek.com/v1",
temperature=0, # R1 reasoning is deterministic — keep at 0
)
response = llm.invoke("What is 17 * 23?")
print(response.content)
Expected output: You'll see a <think> block followed by 391.
Solution
Step 1: Build the R1 Think-Strip Parser
Create a custom output parser that separates the reasoning trace from the final answer:
import re
from langchain_core.output_parsers import BaseOutputParser
from dataclasses import dataclass
@dataclass
class R1Output:
reasoning: str # content inside <think>...</think>
answer: str # everything after the closing </think>
class R1OutputParser(BaseOutputParser[R1Output]):
"""Splits DeepSeek R1 output into (reasoning, answer) pair."""
def parse(self, text: str) -> R1Output:
# Extract think block — R1 always opens with <think>
think_match = re.search(r"<think>(.*?)</think>", text, re.DOTALL)
if think_match:
reasoning = think_match.group(1).strip()
# Answer is everything after </think>, stripped of leading whitespace
answer = text[think_match.end():].strip()
else:
# Model responded without reasoning (e.g., very short factual query)
reasoning = ""
answer = text.strip()
return R1Output(reasoning=reasoning, answer=answer)
@property
def _type(self) -> str:
return "r1_output_parser"
Test the parser standalone:
raw = """<think>
Let me think through this step by step.
17 * 23 = 17 * 20 + 17 * 3 = 340 + 51 = 391
</think>
17 multiplied by 23 equals **391**."""
parser = R1OutputParser()
result = parser.parse(raw)
print(result.answer) # "17 multiplied by 23 equals **391**."
print(result.reasoning) # Full scratchpad
Step 2: Wire R1 into an LCEL Chain
Build a basic reasoning chain using LangChain Expression Language:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
import os
from dotenv import load_dotenv
load_dotenv()
llm = ChatOpenAI(
model="deepseek-reasoner",
api_key=os.getenv("DEEPSEEK_API_KEY"),
base_url="https://api.deepseek.com/v1",
temperature=0,
)
prompt = ChatPromptTemplate.from_messages([
("system", "You are a precise analytical assistant. Reason carefully before answering."),
("human", "{question}"),
])
r1_parser = R1OutputParser()
# Chain: prompt → R1 → parse into R1Output
reasoning_chain = prompt | llm | (lambda msg: r1_parser.parse(msg.content))
result = reasoning_chain.invoke({
"question": "A train travels 120km in 90 minutes. What is its speed in km/h?"
})
print("Answer:", result.answer)
print("Reasoning trace:", result.reasoning[:200], "...")
Expected output:
Answer: The train's speed is 80 km/h.
Reasoning trace: Let me convert the time first. 90 minutes = 1.5 hours.
Speed = Distance / Time = 120 / 1.5 = 80 km/h...
Step 3: Build a Multi-Step Reasoning Pipeline
The real power comes from chaining R1's clean answer into downstream steps. Here's a research-and-summarize pipeline:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
# Step 1: R1 reasons through the problem and returns a structured analysis
analysis_prompt = ChatPromptTemplate.from_messages([
("system", """You are an expert analyst.
Break down the problem, identify key factors, and provide a thorough analysis."""),
("human", "Analyze this problem: {problem}"),
])
# Step 2: A faster model synthesizes R1's analysis into an action plan
# Using gpt-4o-mini here — no reasoning needed for synthesis
from langchain_openai import ChatOpenAI as OpenAIChat
synthesis_llm = OpenAIChat(
model="gpt-4o-mini",
temperature=0.3,
)
synthesis_prompt = ChatPromptTemplate.from_messages([
("system", "Turn this analysis into a concise 3-step action plan. Be direct."),
("human", "Analysis:\n{analysis}"),
])
# Extract just the answer text from R1Output for the next step
extract_answer = RunnableLambda(lambda r1out: {"analysis": r1out.answer})
pipeline = (
analysis_prompt
| llm
| (lambda msg: r1_parser.parse(msg.content))
| extract_answer
| synthesis_prompt
| synthesis_llm
| StrOutputParser()
)
result = pipeline.invoke({
"problem": "Our API response times increased 40% after migrating to a new database cluster."
})
print(result)
This pattern — R1 for deep reasoning, a smaller model for structured output — cuts cost while keeping the analytical quality of R1.
Step 4: Preserve Reasoning Traces with RunnablePassthrough
When debugging or building observability into your pipeline, you want to keep both the reasoning trace and the final answer accessible:
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
def parse_r1_message(msg):
"""Parse AIMessage content into R1Output."""
return r1_parser.parse(msg.content)
# Passthrough preserves the original input alongside R1's output
traced_chain = (
RunnablePassthrough.assign(
r1_result=(analysis_prompt | llm | RunnableLambda(parse_r1_message))
)
| RunnableLambda(lambda x: {
"problem": x["problem"],
"answer": x["r1_result"].answer,
"reasoning_tokens": len(x["r1_result"].reasoning.split()),
})
)
output = traced_chain.invoke({"problem": "Why might a Redis cache hit rate drop from 90% to 60%?"})
print(f"Answer: {output['answer']}")
print(f"Reasoning used ~{output['reasoning_tokens']} tokens")
This gives you a cost signal — R1 charges separately for reasoning tokens. Logging reasoning_tokens per request lets you track where the model is spending compute.
Step 5: Add Streaming Support
R1 reasoning traces can be long. Stream the response to avoid timeout issues in web apps:
async def stream_r1_reasoning(question: str):
"""Stream R1 output, tagging reasoning vs answer chunks."""
messages = [
{"role": "system", "content": "Reason carefully before answering."},
{"role": "user", "content": question},
]
full_response = ""
in_think_block = False
async for chunk in llm.astream(messages):
token = chunk.content
full_response += token
# Detect think block boundaries for UI rendering
if "<think>" in token:
in_think_block = True
print("[REASONING] ", end="", flush=True)
elif "</think>" in token:
in_think_block = False
print("\n[ANSWER] ", end="", flush=True)
else:
print(token, end="", flush=True)
return r1_parser.parse(full_response)
# Run it
import asyncio
result = asyncio.run(stream_r1_reasoning("Explain why gradient descent can get stuck in local minima."))
Verification
Run this end-to-end test to confirm the full pipeline works:
def test_reasoning_pipeline():
test_cases = [
{
"question": "If 5 workers complete a job in 8 days, how many days for 10 workers?",
"expected_contains": "4",
},
{
"question": "What is the time complexity of binary search?",
"expected_contains": "O(log n)",
},
]
chain = prompt | llm | (lambda msg: r1_parser.parse(msg.content))
for case in test_cases:
result = chain.invoke({"question": case["question"]})
assert case["expected_contains"] in result.answer, (
f"Expected '{case['expected_contains']}' in answer.\nGot: {result.answer}"
)
assert len(result.reasoning) > 0, "R1 should always produce reasoning for non-trivial questions"
print(f"✅ PASS: {case['question'][:50]}...")
test_reasoning_pipeline()
You should see:
✅ PASS: If 5 workers complete a job in 8 days, how ma...
✅ PASS: What is the time complexity of binary search?...
Production Considerations
Reasoning token cost adds up fast. DeepSeek R1 bills reasoning tokens and output tokens separately. A complex problem can generate 2,000–10,000 reasoning tokens. Cache prompts where possible and avoid using R1 for queries that don't benefit from deep reasoning (lookup tasks, simple formatting, classification).
Temperature must stay at 0. R1's chain-of-thought is sensitive to temperature. Values above 0 cause the model to abandon reasoning mid-thought or repeat the scratchpad. The DeepSeek docs explicitly recommend temperature=0 for deepseek-reasoner.
Don't use system prompts to restrict reasoning length. Instructions like "think briefly" or "be concise" applied to the system prompt degrade answer quality. R1 needs to reason to its natural stopping point. Apply length constraints to the final answer only, in the human turn.
Rate limits are aggressive on the free tier. The DeepSeek API free tier caps at 10 RPM. For production use, implement exponential backoff:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def invoke_with_retry(chain, input_dict):
return chain.invoke(input_dict)
What You Learned
- DeepSeek R1 wraps reasoning in
<think>tags — always parse before passing output downstream R1OutputParsercleanly separates the reasoning trace from the final answer- LCEL chains compose R1 with cheaper models: use R1 for analysis, GPT-4o-mini for structured output
RunnablePassthrough.assignlets you preserve traces alongside clean answers for observability- Temperature must be 0; reasoning token costs need monitoring in production
When NOT to use this approach: For simple retrieval tasks, keyword classification, or short-form generation, R1 is overkill. The reasoning overhead adds latency (5–30s depending on problem complexity) and cost. Use deepseek-chat for those cases.
Tested on LangChain 0.3.x, langchain-openai 0.2.x, DeepSeek R1 API (deepseek-reasoner), Python 3.12