Your document ingestion system processes 100 docs/min fine. At 10,000 docs/min, your LLM worker crashes and you lose data. Kafka makes it elastic.
That crash isn't a bug; it's a design flaw. You’ve bolted a slow, expensive, stateful LLM API call onto a fast, stateless ingestion service. The result is backpressure that either melts your service or forces you to drop messages. The solution isn't a bigger server; it's a decoupled, parallel, and resilient pipeline. This is where Apache Kafka stops being a "cool messaging system" and starts being the backbone that lets your AI features scale without setting your pager on fire.
Think of Kafka not as a queue, but as a central nervous system for your documents. It’s the reason Kafka processes 7 trillion messages per day at LinkedIn and is used by 80% of Fortune 100 companies for real-time data pipelines. We’re going to build a pipeline that ingests a PDF, splits it, sends chunks to an LLM for enrichment, generates vector embeddings, and stores them—without ever losing a document or overloading your AI budget.
Pipeline Architecture: From Firehose to Vector Store
A naive pipeline is a straight line: Ingest → Parse → LLM → Embed → Store. When the LLM step (taking 2+ seconds per chunk) blocks, the whole line stops. Our Kafka pipeline looks more like a factory with parallel workstations connected by conveyor belts (topics).
Raw Document Topic → Parser Consumer Group → Document Chunk Topic → LLM Enrichment Consumer Group → Enriched Chunk Topic → Embedding Consumer Group → Vector Store Sink
Each arrow is a Kafka topic. Each "Consumer Group" is a scalable pool of workers. The LLM step, your bottleneck, can now be scaled independently. If your embedding service has a hiccup, the enriched chunks pile up safely in their topic (retention.ms=604800000—a week's buffer—is your friend) instead of blowing up the LLM workers.
Topic Design: Your Partition Strategy is Your Parallelism Strategy
This is the most critical decision. The partition count of a topic is the maximum parallelism for any consumer group reading it. Want to run 50 LLM workers? Your document-chunks topic needs at least 50 partitions. More is fine; consumers will idle.
But partitioning randomly is a disaster for LLM context. You need related chunks to be processed in order. The key is your message key. For document chunks, the key should be the document_id. Kafka guarantees order within a partition. All chunks for Document A go to Partition X and are processed in order. Chunks for Document B go elsewhere, enabling parallel processing across documents.
Here’s how you produce with a proper key using confluent-kafka, the production-grade library:
from confluent_kafka import Producer
import json
producer_conf = {
'bootstrap.servers': 'kafka-1:9092,kafka-2:9092',
'retries': 10, # Crucial for resilience
'retry.backoff.ms': 500,
'batch.size': 65536, # Tune for throughput (from benchmark)
'linger.ms': 5
}
producer = Producer(producer_conf)
def produce_document_chunk(document_id, chunk_num, chunk_text):
topic = 'document-chunks'
key = document_id.encode('utf-8') # Key = document_id for ordering
value = json.dumps({'chunk_num': chunk_num, 'text': chunk_text})
producer.produce(topic=topic, key=key, value=value)
producer.poll(0) # Serve delivery callbacks
producer.flush()
Real Error & Fix: If you see LEADER_NOT_AVAILABLE right after creating a topic, your code isn't broken. The cluster needs a moment for leader election. The fix: in your producer config, set retries=10 and retry.backoff.ms=500. It will gracefully retry.
Consumer Group for LLM Workers: Throttling is a Feature
Now, the LLM consumer group reads from document-chunks. You have 50 partitions and 50 workers. Perfect scaling, right? Until you hit your OpenAI or Anthropic rate limit (Requests-Per-Minute, RPM) and get a wall of 429 errors.
The solution is to use consumer configuration as a governor. Control the flow with max.poll.records. This setting limits how many records a consumer fetches per poll. If your LLM call takes 2 seconds, set max.poll.records=1. This means each worker processes at most 30 chunks/minute. 50 workers = 1500 chunks/minute. Stay under your RPM limit.
from confluent_kafka import Consumer, KafkaError
import openai
import time
consumer_conf = {
'bootstrap.servers': 'kafka-1:9092',
'group.id': 'llm-enrichment-group',
'auto.offset.reset': 'earliest',
'enable.auto.commit': False, # We'll commit manually
'max.poll.records': 1, # CRITICAL: Process one chunk at a time
}
consumer = Consumer(consumer_conf)
consumer.subscribe(['document-chunks'])
while True:
msg = consumer.poll(1.0)
if msg is None:
continue
if msg.error():
if msg.error().code() == KafkaError._PARTITION_EOF:
continue
else:
print(msg.error())
break
chunk_data = json.loads(msg.value())
try:
# Your expensive, rate-limited LLM call
enrichment = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": f"Summarize: {chunk_data['text']}"}]
)
# Produce to the next topic (enriched-chunks)
produce_to_next_topic(msg.key(), enrichment.choices[0].message.content)
consumer.commit(msg) # Manual commit on success
except openai.RateLimitError:
print("Rate limit hit. Pausing consumer for 60s.")
# Pause this specific partition to avoid livelock
consumer.pause([msg.partition()])
time.sleep(60)
consumer.resume([msg.partition()])
# Seek to re-process the same message
consumer.seek(msg.partition(), msg.offset())
Dead Letter Topic: The Pipeline's Immune System
Some documents will fail. Malformed PDFs, unsupported languages, LLM timeouts. If you throw an exception and crash, you stop processing. If you skip the message, you lose data. The pattern: catch, route, continue.
For every main topic (document-chunks), create a corresponding document-chunks-dlt (Dead Letter Topic). On any non-transient failure (not a rate limit), publish the failed message—key, value, and error context—to the DLT.
except Exception as e:
# Produce to Dead Letter Topic
dlt_value = {
'original_key': msg.key().decode('utf-8') if msg.key() else None,
'original_value': msg.value().decode('utf-8'),
'error': str(e),
'topic': msg.topic(),
'partition': msg.partition(),
'offset': msg.offset()
}
producer.produce(topic='document-chunks-dlt', value=json.dumps(dlt_value))
producer.poll(0)
# Still commit the original offset, we've handled it
consumer.commit(msg)
Your pipeline keeps running. Later, a separate process can inspect the DLT, fix issues (e.g., re-run OCR), and re-inject messages to the main topic.
Backpressure Handling: Don't Let the Tail Wag the Dog
What if the embedding store (e.g., Pinecone, Weaviate) gets slow? The enriched-chunks topic will start to lag. Without controls, your LLM workers will keep producing, filling the topic and increasing latency.
This is backpressure. Kafka gives you two primary levers:
max.poll.records(Again): Lower this on the embedding consumer to reduce its own intake.- Consumer
pause()/resume(): The LLM consumer group can monitor the lag of the next stage. If theenriched-chunkslag exceeds a threshold (e.g., 10,000 messages), the LLM workers shouldpause()their partitions, temporarily stopping production.
Real Error & Fix: If you see Consumer lag growing indefinitely, the problem is often a mismatch between consumption and processing speed. The fix: check your max.poll.records and processing time. If you're processing slower than you're fetching, you'll fall behind. Also, ensure your partition count is sufficient—if you have 1 partition and 10 consumers, 9 are idle.
Exactly-Once Semantics: When You Need It and When You Don't
Kafka's exactly-once semantics (EOS) are powerful but costly. They use transactional producers and a read_committed isolation level to guarantee a message is processed "once and only once." Do you need it?
Yes, if: You are doing financial ledger updates or deduplicating in a multi-step pipeline where duplicate embedding writes would corrupt your vector store. No, if: You are doing idempotent operations (like overwriting a document summary) or can tolerate at-least-once delivery with deduplication at the sink.
EOS adds latency and complexity. For our pipeline, idempotent producers (enable.idempotence=true) are often enough. They prevent duplicates from producer retries. Combine this with consumer manual commits after successful processing, and you get a robust at-least-once system. Add a idempotency key (like document_id + chunk_num) at your vector store sink for final deduplication.
Monitoring: Lag is Your Canary in the Coal Mine
You don't monitor Kafka by checking if it's "up." You monitor consumer lag—the difference between the latest offset in a topic and the last committed offset by the consumer group. Lag is the backlog. Lag > 0 is normal. Lag growing exponentially is a five-alarm fire.
Set an alert: "Alert if lag > 10,000 for more than 5 minutes." This is your circuit breaker. It means your pipeline is falling behind and you need to either scale consumers or diagnose a downstream failure.
Here’s a quick comparison of why you’d choose Kafka over a traditional message broker for this AI workload, based on the benchmark data:
| Metric | Kafka (3.7, KRaft) | RabbitMQ (High-throughput config) | Implication for AI Pipeline |
|---|---|---|---|
| P99 Latency at 100K msgs/s | 8 ms | 340 ms | Predictable throughput for real-time enrichment. |
| Sustained Throughput | 2-5 GB/s (cluster) | Degrades under sustained load | Handles document burst from enterprise crawlers. |
| Message Size Efficiency | Avro: 40% smaller than JSON | Typically JSON/Protocol Buffers | Lower cost, faster network transfer for large chunks. |
| Rebalance Time (50 consumers) | 2-8 seconds | N/A (different model) | Faster scaling of LLM worker pools up/down. |
Real Error & Fix: If you get org.apache.kafka.common.errors.RecordTooLargeException, your document chunk is bigger than the broker's limit. The fix: increase message.max.bytes on the broker and max.request.size on the producer to match. But first, ask if you should chunk your documents smaller.
Next Steps: From Pipeline to Platform
You now have a resilient, scalable pipeline. Where next?
- Schema Registry: Move from JSON to Avro (40% smaller, 3x faster deserialization). Define schemas for your document chunks, enriched data, and embeddings. This ensures compatibility as your data model evolves.
- Kafka Connect: Use the 200+ production-ready connectors to replace custom ingestion/egress code. Pull documents directly from Google Drive (Source Connector) and sink embeddings to PostgreSQL with pgvector (Sink Connector).
- ksqlDB for Lightweight Enrichment: Before the LLM, use ksqlDB to filter out empty chunks, detect language, or redact PII using simple SQL-like rules. Offload your stateless logic.
- Infrastructure as Code: Use Strimzi to manage your Kafka cluster on Kubernetes. Define topics, users, and connectors via YAML.
Your pipeline is no longer a chain of fragile microservices. It's a set of scalable, observable, and resilient stages connected by Kafka. Your LLM workers can crash, your vector store can have an outage, and your ingestion rate can spike 100x. The system adapts. Now go configure your retention.ms and sleep through the night.