Count LLM Tokens with Tiktoken: Model-Specific Limits 2026

Problem: Your LLM Request Fails With a Context Length Error

LLM token counting with Tiktoken is the fastest way to measure prompt size before it hits the API — and before you pay for a rejected request.

You're sending a prompt. The API returns context_length_exceeded or silently truncates your input. You don't know how many tokens you actually used, and you can't reproduce the error locally.

You'll learn:

How to count tokens accurately for GPT-4o, Claude 3.5, and Gemini 1.5 models
Which tokenizer each model uses and why it matters
How to build a pre-flight token check into your Python application

Time: 15 min | Difficulty: Intermediate

Why Token Counts Vary Between Models

Every LLM uses a different tokenizer — a ruleset that splits text into numeric IDs. Feed the same sentence to GPT-4o and Claude and you'll get different token counts. Build your app around the wrong number and you'll hit limits you didn't anticipate.

LLM token counting flow: text input → tokenizer → token IDs → model context window How raw text moves through a tokenizer before reaching the model's context window

Symptoms:

openai.BadRequestError: This model's maximum context length is 128000 tokens on long documents
Silent truncation mid-conversation when using message history
Billing surprises from over-estimated prompt sizes

The root cause is almost always the same: you estimated token count by dividing character count by 4 (a common heuristic) instead of running the actual tokenizer.

Model Token Limits Cheat Sheet

Model	Tokenizer	Context Window	Max Output
gpt-4o	cl100k_base	128,000	16,384
gpt-4o-mini	cl100k_base	128,000	16,384
gpt-3.5-turbo	cl100k_base	16,385	4,096
claude-3-5-sonnet-20241022	Anthropic BPE	200,000	8,096
claude-3-5-haiku-20241022	Anthropic BPE	200,000	8,096
gemini-1.5-pro	SentencePiece	2,000,000	8,192
gemini-1.5-flash	SentencePiece	1,000,000	8,192
llama-3.1-8b	tiktoken o200k	128,000	4,096

Tiktoken covers all OpenAI models natively. For Claude and Gemini, you use their SDKs — covered in the steps below.

Solution

Step 1: Install Tiktoken and the Anthropic SDK

# Tiktoken requires a Rust compiler on first install — pre-built wheels ship for Python 3.11+
pip install tiktoken anthropic google-generativeai

Expected output:

Successfully installed tiktoken-0.7.0

If it fails:

error: could not find Rust compiler → Run curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh then retry

Step 2: Count Tokens for OpenAI Models

import tiktoken

def count_tokens_openai(text: str, model: str = "gpt-4o") -> int:
    # encoding_for_model returns the exact tokenizer used in production
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

sample = "Explain the transformer architecture in plain English."

print(count_tokens_openai(sample, "gpt-4o"))        # 10
print(count_tokens_openai(sample, "gpt-3.5-turbo")) # 10 — both use cl100k_base

encoding_for_model automatically maps model names to their underlying encoding. GPT-4o and GPT-3.5-Turbo both use cl100k_base, so their counts match for the same input. GPT-4o's newer variants (gpt-4o-2024-11-20 and later) use o200k_base, which handles more languages and emojis efficiently.

# Use o200k_base directly when model name isn't in tiktoken's registry yet
enc = tiktoken.get_encoding("o200k_base")
tokens = enc.encode(sample)
print(len(tokens))  # token count
print(tokens[:5])   # [849, 3943, 279, 37929, 9268] — first 5 token IDs

Step 3: Count Tokens for Chat Messages (With Overhead)

Chat models add formatting tokens for each message — typically 3–4 tokens per turn for the role and separator. Ignoring this overhead causes off-by-~10 errors on long conversations.

import tiktoken

def count_chat_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    # 3 tokens per message: <|start|>{role}\n{content}<|end|>
    tokens_per_message = 3
    tokens_per_name = 1  # if a name field is present

    total = 0
    for msg in messages:
        total += tokens_per_message
        for key, value in msg.items():
            total += len(enc.encode(value))
            if key == "name":
                total += tokens_per_name

    total += 3  # reply priming tokens added by the API
    return total

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

print(count_chat_tokens(messages))  # 26

Step 4: Count Tokens for Claude

Anthropic's tokenizer is not Tiktoken-compatible. The official Python SDK exposes a count_tokens method that calls Anthropic's tokenization endpoint — no API key required for counting.

import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env

def count_tokens_claude(messages: list[dict], model: str = "claude-3-5-sonnet-20241022") -> int:
    # count_tokens returns a TokenCountResponse — .input_tokens is the integer
    response = client.messages.count_tokens(
        model=model,
        messages=messages,
    )
    return response.input_tokens

messages = [{"role": "user", "content": "Explain transformers in plain English."}]
print(count_tokens_claude(messages))  # typically 9–11 — slightly differs from GPT-4o

Token counting calls are free and don't count against your rate limit. Anthropic charges $0.003 per 1,000 input tokens for Claude 3.5 Sonnet (as of March 2026). Accurate counts let you estimate cost before each call.

Step 5: Build a Pre-Flight Guard

Wrap token counting into a guard that raises before the API call — not after.

import tiktoken
import anthropic
from typing import Literal

MODEL_LIMITS = {
    "gpt-4o": 128_000,
    "gpt-4o-mini": 128_000,
    "gpt-3.5-turbo": 16_385,
    "claude-3-5-sonnet-20241022": 200_000,
    "claude-3-5-haiku-20241022": 200_000,
}

OPENAI_MODELS = {"gpt-4o", "gpt-4o-mini", "gpt-3.5-turbo"}

def preflight_token_check(
    messages: list[dict],
    model: str,
    reserved_output: int = 2_000,  # reserve headroom for the reply
) -> int:
    limit = MODEL_LIMITS.get(model)
    if limit is None:
        raise ValueError(f"Unknown model: {model}. Add it to MODEL_LIMITS.")

    if model in OPENAI_MODELS:
        enc = tiktoken.encoding_for_model(model)
        token_count = sum(len(enc.encode(m["content"])) for m in messages) + 3 * len(messages)
    else:
        client = anthropic.Anthropic()
        token_count = client.messages.count_tokens(model=model, messages=messages).input_tokens

    effective_limit = limit - reserved_output
    if token_count > effective_limit:
        raise ValueError(
            f"Prompt too large: {token_count} tokens exceeds {effective_limit} "
            f"({limit} limit − {reserved_output} reserved). Trim input before calling."
        )

    return token_count

# Usage
messages = [{"role": "user", "content": "Summarize the following document: " + "x" * 5000}]
try:
    count = preflight_token_check(messages, "gpt-4o")
    print(f"Safe to send — {count} tokens")
except ValueError as e:
    print(f"Blocked: {e}")

Expected output (for the padded example above):

Blocked: Prompt too large: 1265 tokens exceeds 126000 (128000 limit − 2000 reserved). Trim input before calling.

Verification

Run a quick sanity check against the OpenAI API to confirm your local count matches what the server bills:

from openai import OpenAI
import tiktoken

client = OpenAI()  # reads OPENAI_API_KEY from env

text = "The transformer model uses self-attention to process sequences in parallel."
enc = tiktoken.encoding_for_model("gpt-4o")
local_count = len(enc.encode(text))

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": text}],
    max_tokens=1,  # minimize output cost during verification
)

api_count = response.usage.prompt_tokens
print(f"Local: {local_count} | API: {api_count} | Match: {local_count == api_count}")

You should see: Local: 15 | API: 15 | Match: True

If counts diverge, you're likely missing message overhead tokens — revisit Step 3.

What You Learned

Tiktoken's encoding_for_model gives exact token counts for all OpenAI models; use get_encoding("o200k_base") for newer GPT-4o variants not yet in the registry.
Chat message overhead adds 3–4 tokens per turn — skipping this causes subtle count errors in multi-turn conversations.
Claude requires Anthropic's SDK count_tokens endpoint; it's free and does not consume your rate limit.
The reserved_output parameter in a pre-flight guard prevents errors from models that share the context window between input and output (most do).
Tiktoken is not useful for Gemini — use the google-generativeai SDK's count_tokens method, which calls Google's SentencePiece tokenizer.

Tested on Tiktoken 0.7.0, Python 3.12, openai 1.30, anthropic 0.26, macOS Sequoia & Ubuntu 24.04

FAQ

Q: Does Tiktoken work offline? A: Yes for encoding and decoding. The vocabulary files download once and cache in ~/.tiktoken. The Anthropic count_tokens call requires network access to Anthropic's API endpoint.

Q: What is the difference between cl100k_base and o200k_base? A: cl100k_base is the tokenizer for GPT-3.5 and GPT-4. o200k_base is newer — it has a 200,000-token vocabulary (vs 100,000) and handles non-Latin scripts and emojis more efficiently, producing slightly lower counts for multilingual text.

Q: How do I count tokens for a PDF or image in a vision request? A: Images use a fixed tile-based formula, not text tokenization. A 512×512 image costs 85 tokens in low-detail mode; high-detail mode uses 170 tokens per 512px tile plus 85 base tokens. PDFs must be extracted to text first — use pypdf or pdfplumber, then count the resulting string.

Q: Can I use Tiktoken for Llama 3 models? A: Llama 3.x uses tiktoken with the o200k_base encoding under the hood. You can use tiktoken.get_encoding("o200k_base") as a close approximation, but the exact count may differ by 1–3 tokens per message due to Llama's custom special tokens (<|begin_of_text|>, <|eot_id|>).