DeepSeek R1 Chain-of-Thought: How the Reasoning Works

How DeepSeek R1 generates chain-of-thought reasoning tokens, why it outperforms on math and code, and how to use the thinking trace in your app.

What Is DeepSeek R1 and Why Reasoning Matters

DeepSeek R1 is a 671B mixture-of-experts model trained to reason before it answers. Unlike a standard LLM that maps input → output in one pass, R1 emits a private chain-of-thought — a scratchpad of intermediate reasoning steps — before producing a final response. That scratchpad is what makes it competitive with o1 on math and coding benchmarks.

Understanding how the reasoning works lets you use R1 correctly: know when to expose the thinking trace, when to suppress it, and why the model behaves differently from GPT-4o or Claude on hard problems.


How Chain-of-Thought Is Trained Into R1

DeepSeek R1 didn't learn reasoning purely from human demonstrations. The team used a two-phase process.

Phase 1: Pure Reinforcement Learning (R1-Zero)

The first version, R1-Zero, was trained with Group Relative Policy Optimization (GRPO) using only outcome-based rewards — no step-by-step reasoning labels. The model received:

  • +1 for a correct final answer (verified against a ground truth checker)
  • 0 for an incorrect answer

No human ever labeled how to reason. The model discovered that generating intermediate steps before committing to an answer improved its reward signal. Chain-of-thought emerged from the reward structure, not from imitation.

This is significant. It means reasoning is a learned policy, not a retrieval pattern. The model figured out that "thinking out loud" was instrumentally useful for getting correct answers.

Phase 2: Supervised Fine-Tuning + RLHF (R1)

R1-Zero had strong reasoning but produced awkward output — mixed languages, inconsistent formatting, sometimes incoherent final answers. The production R1 model adds two more stages:

  1. Cold-start SFT: A small set of high-quality (reasoning trace, answer) pairs primes the model to produce well-formatted thinking tokens in a consistent style before RL training begins.
  2. RLHF alignment: Human preference data trains the model to be helpful, harmless, and produce readable output — without degrading the reasoning ability built in phase 1.

The result is a model that reasons reliably and outputs coherently.


The Thinking Token Format

When you call R1 through DeepSeek's API, the reasoning trace appears between <think> and </think> tags, before the final answer.

<think>
The user wants the 15th Fibonacci number.
F(1)=1, F(2)=1, F(3)=2, ...
Let me compute step by step.
F(10)=55, F(11)=89, F(12)=144, F(13)=233, F(14)=377, F(15)=610.
</think>

The 15th Fibonacci number is **610**.

The content inside <think> is the model talking to itself. It:

  • Plans approaches before committing
  • Backtracks when an approach fails ("wait, that's wrong because...")
  • Verifies intermediate results
  • Explores edge cases before writing the final answer

This is not a post-hoc explanation. It is generated before the final answer token, and the final answer conditions on everything in the thinking block.


What's Inside a Real Reasoning Trace

A non-trivial reasoning trace follows a recognizable structure, though it's not enforced — it emerges from training.

Component A: Problem decomposition
"The user is asking for X. To solve X I need Y and Z."

Component B: Approach selection
"I could use method 1, but method 1 fails when... method 2 handles that."

Component C: Step execution
"Step 1: ... result is A. Step 2: using A, compute B. ..."

Component D: Self-verification
"Check: does B satisfy the original constraint? Yes/No → correct/backtrack."

Component E: Answer synthesis
"Therefore the answer is..."

On easy prompts, the trace is short and skips components. On hard math or multi-step code problems, traces run to thousands of tokens and include multiple backtrack loops.


Why This Outperforms Standard Prompting

Standard LLM (no explicit reasoning)

Token stream: [prompt tokens] → [answer tokens]

The model has one forward pass to integrate all the information and produce the answer. On problems that require sequential logical steps, the intermediate state must be encoded implicitly in the residual stream — there's no working memory.

R1 with chain-of-thought

Token stream: [prompt tokens] → [<think> reasoning tokens </think>] → [answer tokens]

The reasoning tokens act as external working memory. Each reasoning token is in the context window and can be attended to by later tokens. The model can reference a result it computed 200 tokens ago without relying on the residual stream to carry it.

This is why R1 scores significantly higher on:

  • AIME 2024 (math competition): 79.8% vs GPT-4o's 9.3%
  • Codeforces (competitive programming): 2029 Elo rating
  • MATH-500: 97.3% pass@1

The gain comes from the architecture of the token stream, not from a larger model. The 7B distilled version of R1 outperforms GPT-4o on several benchmarks for the same reason.


Using R1's Reasoning in Your Application

Option 1: Full trace (debugging, educational tools)

Expose the <think> block to users when the reasoning process itself is the product — tutoring apps, math solvers, code explainers.

import openai  # DeepSeek API is OpenAI-compatible

client = openai.OpenAI(
    api_key="your-deepseek-api-key",
    base_url="https://api.deepseek.com"
)

response = client.chat.completions.create(
    model="deepseek-reasoner",  # R1 model endpoint
    messages=[{"role": "user", "content": "Solve: 3x + 7 = 22"}]
)

# reasoning_content contains the <think> block
reasoning = response.choices[0].message.reasoning_content
answer = response.choices[0].message.content

print("Reasoning:", reasoning)
print("Answer:", answer)

Option 2: Suppress the trace (production APIs, chatbots)

When you only want the final answer, strip or ignore the thinking content. The final answer in content is already conditioned on the full trace — you get the reasoning benefit without exposing it.

# Just use content, ignore reasoning_content
answer = response.choices[0].message.content

Do not inject the reasoning trace back into the next message in a multi-turn conversation. The model was not trained to reason about its own previous thinking as conversation history. It adds tokens without improving quality.

Option 3: Structured output after reasoning

R1 can reason through a problem and then output structured JSON. Force JSON only in the answer phase, not during thinking.

response = client.chat.completions.create(
    model="deepseek-reasoner",
    messages=[
        {
            "role": "user",
            "content": (
                "Analyze this bug report and classify it. "
                "Respond ONLY with JSON: {\"severity\": \"low|medium|high\", "
                "\"category\": \"...\", \"root_cause\": \"...\"}\n\n"
                "Bug: Application crashes when uploading files > 10MB on Safari."
            )
        }
    ]
)

import json
result = json.loads(response.choices[0].message.content)

R1 will reason through the classification in <think>, then emit clean JSON in content.


When Not to Use R1

R1's reasoning tokens cost latency and money. It's not the right tool for every task.

Avoid R1 for:

  • Simple retrieval or summarization — no reasoning needed, you're paying for thinking tokens that add nothing
  • Real-time chat where response latency matters — first token is delayed by the full thinking pass
  • Creative writing — the reward model wasn't optimized for prose quality; Claude or GPT-4o produce better creative output
  • Tasks requiring very long context recall — the thinking trace consumes context window budget

Use R1 for:

  • Multi-step math, proofs, or verification
  • Code generation where correctness matters more than speed
  • Complex reasoning chains: legal analysis, diagnostic workflows, planning
  • Any task where you'd otherwise prompt with "think step by step" — R1 does this natively

The Distilled Models

DeepSeek released six distilled versions of R1, trained by fine-tuning open-source base models on R1's reasoning traces:

ModelBaseParametersBest use
DeepSeek-R1-Distill-Qwen-1.5BQwen 2.51.5BEdge, mobile
DeepSeek-R1-Distill-Qwen-7BQwen 2.57BLocal inference, Ollama
DeepSeek-R1-Distill-Qwen-14BQwen 2.514BMid-range GPU
DeepSeek-R1-Distill-Qwen-32BQwen 2.532BHigh-end GPU
DeepSeek-R1-Distill-Llama-8BLlama 38BLlama ecosystem
DeepSeek-R1-Distill-Llama-70BLlama 370BMulti-GPU server

The 7B distilled model runs on a single 8GB VRAM GPU via Ollama:

ollama pull deepseek-r1:7b
ollama run deepseek-r1:7b "What is the time complexity of merge sort and why?"

The <think> blocks appear in the terminal output. Performance on reasoning benchmarks is closer to the full 671B model than its size suggests — the distillation process transfers reasoning behavior, not just weights.


Production Considerations

Token cost: Thinking tokens count toward your usage bill on the DeepSeek API. A hard math problem can generate 2,000–8,000 reasoning tokens before the answer. Budget accordingly.

Latency: Time-to-first-token is higher than non-reasoning models because the model must complete the thinking pass before the answer begins streaming. Plan for 5–15 seconds on complex prompts.

Temperature: Keep temperature at 0.6 or below for reasoning tasks. Higher temperatures cause the thinking trace to wander and degrade answer quality — the model's RL training assumed focused, deterministic reasoning steps.

Context window: DeepSeek-R1 supports 128K context, but thinking tokens consume that budget. In long multi-turn conversations, the accumulated reasoning from previous turns can crowd out your actual content.


Summary

  • R1 generates chain-of-thought through reinforcement learning on outcome rewards — reasoning emerged without step-by-step human labels
  • The <think> block is external working memory: reasoning tokens are in context and attended to by the final answer
  • This architecture explains benchmark gains on math and code — sequential problems benefit from explicit intermediate steps
  • Use reasoning_content when the trace is the product; use only content when you want the answer
  • Don't feed thinking traces back into conversation history
  • The 7B distilled model runs locally via Ollama and outperforms much larger non-reasoning models on structured tasks

Tested against DeepSeek-R1 API v1, DeepSeek-R1-Distill-Qwen-7B via Ollama 0.5.4, Ubuntu 24.04