Problem: Your LLM Request Fails With a Context Length Error
LLM token counting with Tiktoken is the fastest way to measure prompt size before it hits the API — and before you pay for a rejected request.
You're sending a prompt. The API returns context_length_exceeded or silently truncates your input. You don't know how many tokens you actually used, and you can't reproduce the error locally.
You'll learn:
- How to count tokens accurately for GPT-4o, Claude 3.5, and Gemini 1.5 models
- Which tokenizer each model uses and why it matters
- How to build a pre-flight token check into your Python application
Time: 15 min | Difficulty: Intermediate
Why Token Counts Vary Between Models
Every LLM uses a different tokenizer — a ruleset that splits text into numeric IDs. Feed the same sentence to GPT-4o and Claude and you'll get different token counts. Build your app around the wrong number and you'll hit limits you didn't anticipate.
How raw text moves through a tokenizer before reaching the model's context window
Symptoms:
openai.BadRequestError: This model's maximum context length is 128000 tokenson long documents- Silent truncation mid-conversation when using message history
- Billing surprises from over-estimated prompt sizes
The root cause is almost always the same: you estimated token count by dividing character count by 4 (a common heuristic) instead of running the actual tokenizer.
Model Token Limits Cheat Sheet
| Model | Tokenizer | Context Window | Max Output |
|---|---|---|---|
| gpt-4o | cl100k_base | 128,000 | 16,384 |
| gpt-4o-mini | cl100k_base | 128,000 | 16,384 |
| gpt-3.5-turbo | cl100k_base | 16,385 | 4,096 |
| claude-3-5-sonnet-20241022 | Anthropic BPE | 200,000 | 8,096 |
| claude-3-5-haiku-20241022 | Anthropic BPE | 200,000 | 8,096 |
| gemini-1.5-pro | SentencePiece | 2,000,000 | 8,192 |
| gemini-1.5-flash | SentencePiece | 1,000,000 | 8,192 |
| llama-3.1-8b | tiktoken o200k | 128,000 | 4,096 |
Tiktoken covers all OpenAI models natively. For Claude and Gemini, you use their SDKs — covered in the steps below.
Solution
Step 1: Install Tiktoken and the Anthropic SDK
# Tiktoken requires a Rust compiler on first install — pre-built wheels ship for Python 3.11+
pip install tiktoken anthropic google-generativeai
Expected output:
Successfully installed tiktoken-0.7.0
If it fails:
error: could not find Rust compiler→ Runcurl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | shthen retry
Step 2: Count Tokens for OpenAI Models
import tiktoken
def count_tokens_openai(text: str, model: str = "gpt-4o") -> int:
# encoding_for_model returns the exact tokenizer used in production
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
sample = "Explain the transformer architecture in plain English."
print(count_tokens_openai(sample, "gpt-4o")) # 10
print(count_tokens_openai(sample, "gpt-3.5-turbo")) # 10 — both use cl100k_base
encoding_for_model automatically maps model names to their underlying encoding. GPT-4o and GPT-3.5-Turbo both use cl100k_base, so their counts match for the same input. GPT-4o's newer variants (gpt-4o-2024-11-20 and later) use o200k_base, which handles more languages and emojis efficiently.
# Use o200k_base directly when model name isn't in tiktoken's registry yet
enc = tiktoken.get_encoding("o200k_base")
tokens = enc.encode(sample)
print(len(tokens)) # token count
print(tokens[:5]) # [849, 3943, 279, 37929, 9268] — first 5 token IDs
Step 3: Count Tokens for Chat Messages (With Overhead)
Chat models add formatting tokens for each message — typically 3–4 tokens per turn for the role and separator. Ignoring this overhead causes off-by-~10 errors on long conversations.
import tiktoken
def count_chat_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
enc = tiktoken.encoding_for_model(model)
# 3 tokens per message: <|start|>{role}\n{content}<|end|>
tokens_per_message = 3
tokens_per_name = 1 # if a name field is present
total = 0
for msg in messages:
total += tokens_per_message
for key, value in msg.items():
total += len(enc.encode(value))
if key == "name":
total += tokens_per_name
total += 3 # reply priming tokens added by the API
return total
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
]
print(count_chat_tokens(messages)) # 26
Step 4: Count Tokens for Claude
Anthropic's tokenizer is not Tiktoken-compatible. The official Python SDK exposes a count_tokens method that calls Anthropic's tokenization endpoint — no API key required for counting.
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env
def count_tokens_claude(messages: list[dict], model: str = "claude-3-5-sonnet-20241022") -> int:
# count_tokens returns a TokenCountResponse — .input_tokens is the integer
response = client.messages.count_tokens(
model=model,
messages=messages,
)
return response.input_tokens
messages = [{"role": "user", "content": "Explain transformers in plain English."}]
print(count_tokens_claude(messages)) # typically 9–11 — slightly differs from GPT-4o
Token counting calls are free and don't count against your rate limit. Anthropic charges $0.003 per 1,000 input tokens for Claude 3.5 Sonnet (as of March 2026). Accurate counts let you estimate cost before each call.
Step 5: Build a Pre-Flight Guard
Wrap token counting into a guard that raises before the API call — not after.
import tiktoken
import anthropic
from typing import Literal
MODEL_LIMITS = {
"gpt-4o": 128_000,
"gpt-4o-mini": 128_000,
"gpt-3.5-turbo": 16_385,
"claude-3-5-sonnet-20241022": 200_000,
"claude-3-5-haiku-20241022": 200_000,
}
OPENAI_MODELS = {"gpt-4o", "gpt-4o-mini", "gpt-3.5-turbo"}
def preflight_token_check(
messages: list[dict],
model: str,
reserved_output: int = 2_000, # reserve headroom for the reply
) -> int:
limit = MODEL_LIMITS.get(model)
if limit is None:
raise ValueError(f"Unknown model: {model}. Add it to MODEL_LIMITS.")
if model in OPENAI_MODELS:
enc = tiktoken.encoding_for_model(model)
token_count = sum(len(enc.encode(m["content"])) for m in messages) + 3 * len(messages)
else:
client = anthropic.Anthropic()
token_count = client.messages.count_tokens(model=model, messages=messages).input_tokens
effective_limit = limit - reserved_output
if token_count > effective_limit:
raise ValueError(
f"Prompt too large: {token_count} tokens exceeds {effective_limit} "
f"({limit} limit − {reserved_output} reserved). Trim input before calling."
)
return token_count
# Usage
messages = [{"role": "user", "content": "Summarize the following document: " + "x" * 5000}]
try:
count = preflight_token_check(messages, "gpt-4o")
print(f"Safe to send — {count} tokens")
except ValueError as e:
print(f"Blocked: {e}")
Expected output (for the padded example above):
Blocked: Prompt too large: 1265 tokens exceeds 126000 (128000 limit − 2000 reserved). Trim input before calling.
Verification
Run a quick sanity check against the OpenAI API to confirm your local count matches what the server bills:
from openai import OpenAI
import tiktoken
client = OpenAI() # reads OPENAI_API_KEY from env
text = "The transformer model uses self-attention to process sequences in parallel."
enc = tiktoken.encoding_for_model("gpt-4o")
local_count = len(enc.encode(text))
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": text}],
max_tokens=1, # minimize output cost during verification
)
api_count = response.usage.prompt_tokens
print(f"Local: {local_count} | API: {api_count} | Match: {local_count == api_count}")
You should see: Local: 15 | API: 15 | Match: True
If counts diverge, you're likely missing message overhead tokens — revisit Step 3.
What You Learned
- Tiktoken's
encoding_for_modelgives exact token counts for all OpenAI models; useget_encoding("o200k_base")for newer GPT-4o variants not yet in the registry. - Chat message overhead adds 3–4 tokens per turn — skipping this causes subtle count errors in multi-turn conversations.
- Claude requires Anthropic's SDK
count_tokensendpoint; it's free and does not consume your rate limit. - The
reserved_outputparameter in a pre-flight guard prevents errors from models that share the context window between input and output (most do). - Tiktoken is not useful for Gemini — use the
google-generativeaiSDK'scount_tokensmethod, which calls Google's SentencePiece tokenizer.
Tested on Tiktoken 0.7.0, Python 3.12, openai 1.30, anthropic 0.26, macOS Sequoia & Ubuntu 24.04
FAQ
Q: Does Tiktoken work offline?
A: Yes for encoding and decoding. The vocabulary files download once and cache in ~/.tiktoken. The Anthropic count_tokens call requires network access to Anthropic's API endpoint.
Q: What is the difference between cl100k_base and o200k_base?
A: cl100k_base is the tokenizer for GPT-3.5 and GPT-4. o200k_base is newer — it has a 200,000-token vocabulary (vs 100,000) and handles non-Latin scripts and emojis more efficiently, producing slightly lower counts for multilingual text.
Q: How do I count tokens for a PDF or image in a vision request?
A: Images use a fixed tile-based formula, not text tokenization. A 512×512 image costs 85 tokens in low-detail mode; high-detail mode uses 170 tokens per 512px tile plus 85 base tokens. PDFs must be extracted to text first — use pypdf or pdfplumber, then count the resulting string.
Q: Can I use Tiktoken for Llama 3 models?
A: Llama 3.x uses tiktoken with the o200k_base encoding under the hood. You can use tiktoken.get_encoding("o200k_base") as a close approximation, but the exact count may differ by 1–3 tokens per message due to Llama's custom special tokens (<|begin_of_text|>, <|eot_id|>).