DeepSeek R1 API Integration: Python Client Tutorial 2026

Problem: DeepSeek R1 Returns Reasoning Tokens You Don't Know How to Handle

DeepSeek R1 isn't a standard chat API. It returns a reasoning_content field alongside the regular content field — and if you send requests without accounting for that, you'll get malformed responses, hit token limits unexpectedly, or silently discard the reasoning trace your prompt was designed to produce.

You'll learn:

How to authenticate and send your first DeepSeek R1 API request in Python
How to extract and use reasoning_content vs content correctly
How to stream responses and handle rate limits without dropping requests

Time: 20 min | Difficulty: Intermediate

Why DeepSeek R1 Is Different from Standard LLM APIs

Most LLM APIs return one field: the model's reply. DeepSeek R1 is a reasoning model — it thinks before it answers. That thinking is exposed as reasoning_content in the response object, separate from the final content.

If you build a client expecting only content, you'll miss the reasoning entirely. Worse, if you pass reasoning_content back in the next turn's message history, the API will error — it's output-only.

Two fields to know:

message.reasoning_content — the chain-of-thought trace (can be thousands of tokens)
message.content — the final answer to return to users

Solution

Step 1: Install the OpenAI SDK and Set Up Your Key

DeepSeek's API is OpenAI-compatible. Use the openai Python package with a custom base_url.

# uv is faster than pip — use it if you have it
uv add openai python-dotenv

# or pip
pip install openai python-dotenv

Create a .env file:

DEEPSEEK_API_KEY=sk-your-key-here

Get your key from platform.deepseek.com.

Step 2: Initialize the Client

import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI(
    api_key=os.environ["DEEPSEEK_API_KEY"],
    base_url="https://api.deepseek.com",  # Not api.openai.com
)

The base_url override is all that separates this from a standard OpenAI client. Every other method call is identical.

Step 3: Send a Request and Extract Both Fields

def ask_r1(prompt: str) -> dict:
    response = client.chat.completions.create(
        model="deepseek-reasoner",  # R1 model identifier
        messages=[{"role": "user", "content": prompt}],
    )

    message = response.choices[0].message

    return {
        # Chain-of-thought — useful for debugging, logging, or showing "thinking"
        "reasoning": message.reasoning_content,
        # Final answer — what you show to users
        "answer": message.content,
    }


result = ask_r1("What is the time complexity of merge sort, and why?")

print("=== Reasoning ===")
print(result["reasoning"])
print("\n=== Answer ===")
print(result["answer"])

Expected output:

=== Reasoning ===
Let me think through merge sort step by step. Merge sort divides the array...
[several paragraphs of reasoning]

=== Answer ===
Merge sort has O(n log n) time complexity in all cases...

If it fails:

AuthenticationError → Double-check DEEPSEEK_API_KEY in .env and that load_dotenv() runs before the client is created
model not found → Use "deepseek-reasoner" exactly — not "deepseek-r1" or "r1"

Step 4: Stream Responses for Long Reasoning Traces

R1's reasoning can run 2,000–8,000 tokens before producing an answer. Without streaming, your user waits in silence for 30+ seconds. Stream it.

def ask_r1_stream(prompt: str) -> None:
    stream = client.chat.completions.create(
        model="deepseek-reasoner",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    )

    reasoning_buffer = []
    answer_buffer = []
    in_reasoning = True

    for chunk in stream:
        delta = chunk.choices[0].delta

        if delta.reasoning_content:
            # Still in the thinking phase
            reasoning_buffer.append(delta.reasoning_content)
            print(delta.reasoning_content, end="", flush=True)

        elif delta.content:
            # Reasoning finished; answer is streaming now
            if in_reasoning:
                in_reasoning = False
                print("\n\n--- Answer ---\n")
            answer_buffer.append(delta.content)
            print(delta.content, end="", flush=True)

    print()  # Newline after stream ends
    return {
        "reasoning": "".join(reasoning_buffer),
        "answer": "".join(answer_buffer),
    }

The transition from reasoning_content chunks to content chunks marks when R1 stops thinking and starts answering. Track that boundary to split your UI into "thinking..." and final response states.

Step 5: Build Correct Multi-Turn Message History

This is where most R1 integrations break. You cannot include reasoning_content in the conversation history you send back to the API.

# ✅ Correct — only include role/content in history
def build_history(turns: list[dict]) -> list[dict]:
    """
    turns = [{"user": "...", "answer": "..."}, ...]
    reasoning_content is NOT sent back — API rejects it
    """
    messages = []
    for turn in turns:
        messages.append({"role": "user", "content": turn["user"]})
        messages.append({"role": "assistant", "content": turn["answer"]})
    return messages


conversation = []

def chat(user_message: str) -> str:
    history = build_history(conversation)
    history.append({"role": "user", "content": user_message})

    response = client.chat.completions.create(
        model="deepseek-reasoner",
        messages=history,
    )

    message = response.choices[0].message
    result = {
        "user": user_message,
        "reasoning": message.reasoning_content,
        "answer": message.content,
    }

    conversation.append(result)
    return result["answer"]

Step 6: Handle Rate Limits with Exponential Backoff

DeepSeek's free tier allows 8 requests per minute. The paid tier is higher but still rate-limited during peak hours. Don't let a 429 crash your app.

import time
import random
from openai import RateLimitError

def ask_r1_with_retry(prompt: str, max_retries: int = 4) -> dict:
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="deepseek-reasoner",
                messages=[{"role": "user", "content": prompt}],
            )
            message = response.choices[0].message
            return {"reasoning": message.reasoning_content, "answer": message.content}

        except RateLimitError:
            if attempt == max_retries - 1:
                raise  # Out of retries, let it bubble up

            # Exponential backoff with jitter: 2s, 4s, 8s + random 0–1s
            wait = (2 ** attempt) + random.random()
            print(f"Rate limited. Retrying in {wait:.1f}s...")
            time.sleep(wait)

Verification

Run this end-to-end test:

if __name__ == "__main__":
    result = ask_r1_with_retry("Solve: if 2x + 3 = 11, what is x?")

    assert result["answer"], "answer field is empty"
    assert result["reasoning"], "reasoning field is empty — check model name"
    assert "4" in result["answer"], "unexpected answer content"

    print("✅ DeepSeek R1 client working correctly")
    print(f"Reasoning tokens: ~{len(result['reasoning'].split())}")
    print(f"Answer: {result['answer']}")

You should see:

✅ DeepSeek R1 client working correctly
Reasoning tokens: ~180
Answer: x = 4

What You Learned

deepseek-reasoner returns two separate fields: reasoning_content (thinking) and content (answer) — handle both explicitly
Never include reasoning_content in conversation history — send only content back
Streaming is essential for R1; reasoning traces routinely exceed 2,000 tokens before the answer begins
Exponential backoff with jitter is the correct pattern for 429s on shared API infrastructure

Limitation: deepseek-reasoner does not support function calling or tool use as of March 2026. For tool-use workflows, use deepseek-chat (V3) instead and call R1 separately for reasoning-heavy subtasks.

Tested on DeepSeek API v1, openai SDK 1.68.0, Python 3.12, macOS and Ubuntu 24.04