Your LLM document processing endpoint times out for large files and your users see 504 errors. A job queue makes it async and reliable. You’ve got the GPU horsepower, but your synchronous API is a brittle choke point. Every time a user uploads a 200-page PDF for summarization, your Flask or Express app spins up, blocks the request thread for 45 seconds, and prays the client doesn’t disconnect. This isn't a scaling problem; it's an architectural dead end.
Redis is used by 30% of professional developers—the #2 most popular NoSQL database—for good reason (Stack Overflow 2025). It’s not just a cache. Its atomic data structures are the perfect foundation for a job queue that can handle the messy, long-running, and failure-prone nature of AI workloads. Let’s build one that doesn’t drop jobs on the floor.
Why Your AI API Needs a Queue Yesterday
Your shiny synchronous endpoint is a liability. It couples user request latency to the most unpredictable part of your stack: the LLM provider’s API. A network blip, a temporary rate limit, or a slow model response turns directly into a user-visible failure. A job queue inserts a buffer—a shock absorber—between the HTTP request that enqueues work and the worker process that executes it.
The pattern is simple:
- HTTP Request (Fast): Validate input, create a job in Redis, immediately return a
202 Acceptedwith a job ID. - Queue (Durable): Redis holds the job state durably. Jobs survive application restarts.
- Worker (Async): Separate processes pull jobs, call the LLM API, handle retries, and update status.
- Client Polling/Webhook (Async): The client polls for completion via the job ID or listens for a webhook.
This decoupling means your web server stays responsive, your users get instant feedback, and your background processing can be robust, with retries, priorities, and monitoring. You stop treating AI calls as function calls and start treating them as asynchronous, fallible operations that need management.
BullMQ and Redis: The Engine Room
BullMQ is a Node.js library that builds a robust job queue on top of Redis’s precise primitives. It’s not using simple lists (LPUSH/BRPOP). Under the hood, it uses Redis Sorted Sets for delayed and prioritized jobs, Lists for waiting jobs, and Hashes to store job state. This is key: BullMQ leverages the right data structure for each sub-problem, giving you reliability without having to reinvent the wheel.
Here’s the architecture:
- Queue: The manager. Created in your API server. It knows how to add jobs.
// In your API server (e.g., Express route handler) import { Queue } from 'bullmq'; import IORedis from 'ioredis'; const connection = new IORedis({ host: 'localhost', port: 6379 }); // Use ioredis const documentQueue = new Queue('documentProcessing', { connection }); // When a user uploads a document app.post('/summarize', async (req, res) => { const { documentId, priority } = req.body; const job = await documentQueue.add( 'process-pdf', { documentId }, { jobId: `doc_${documentId}`, // Prevent duplicates priority: priority || 0, // Higher priority = processed sooner } ); res.status(202).json({ jobId: job.id }); }); - Worker: The doer. Runs in a separate process (or server). It fetches jobs and executes them.
// In your worker process (e.g., worker.js) import { Worker } from 'bullmq'; import IORedis from 'ioredis'; import { callLLMAPI } from './llm-service.js'; const connection = new IORedis({ host: 'localhost', port: 6379 }); const worker = new Worker('documentProcessing', async job => { console.log(`Processing job ${job.id}: ${job.data.documentId}`); const summary = await callLLMAPI(job.data.documentId); return { summary }; // Result stored in Redis }, { connection, concurrency: 5 } // Critical: Control parallel execution ); worker.on('completed', job => { console.log(`Job ${job.id} completed! Result:`, job.returnvalue); // Here you could trigger a webhook }); - Job Lifecycle:
waiting->active->completed/failed. BullMQ manages this state in Redis hashes, giving you a full audit trail.
Designing Jobs for the Real World
Throwing a raw prompt into the queue is asking for trouble. Your job schema is a contract. Enforce it.
- Input Validation: Validate before the queue. The API should reject malformed requests. The job payload should be the minimal, validated data needed for processing (e.g.,
documentId, not the 50MB file content). - Priority: Use it. User-facing requests get priority
1. Internal batch jobs get priority100(lower). BullMQ’s sorted set (ZADD) makes this efficient—O(log N)per job. A Sorted SetZADDof 1M members takes ~1.2ms. - Deduplication: The
jobIdoption is your friend. Use a natural key likedoc_${documentId}. If the user double-clicks, the secondQueue.add()will be a no-op for an existing ID, preventing duplicate processing.
Retry Strategy: Exponential Backoff is Non-Negotiable
LLM APIs rate limit you. Networks fail. The naive "retry immediately" approach will hammer the API and fail repeatedly. You need exponential backoff.
BullMQ has this built-in. Configure it in the job options or the worker. For a rate limit error (e.g., HTTP 429), you want the job to retry with increasing delays.
// In your Queue.add() options
const job = await documentQueue.add('process-pdf', data, {
attempts: 5, // Try up to 5 times
backoff: {
type: 'exponential', // Wait 2^n * delay milliseconds on nth retry
delay: 2000, // Start with 2 seconds
},
});
// In your worker, you can also decide to fail faster on certain errors
const worker = new Worker('documentProcessing', async job => {
try {
return await callLLMAPI(job.data);
} catch (error) {
if (error.statusCode === 429) {
// Exponential backoff will handle this via 'attempts'
throw error;
}
if (error.statusCode === 400) {
// Bad request, no point retrying
throw new Worker.UnrecoverableError('Invalid input');
}
throw error;
}
}, { connection });
This pattern ensures your system gracefully handles temporary outages without creating a retry storm.
The Dead Letter Queue: Where Failed Jobs Go for Inspection
Some jobs will fail all their retries. Maybe the input data is permanently corrupted, or an external service is down for hours. You must not silently discard them. Enter the Dead Letter Queue (DLQ).
A DLQ is just another BullMQ queue that holds these permanently failed jobs for manual review or automated alerting.
// Create a DLQ
const deadLetterQueue = new Queue('documentProcessing:dead', { connection });
// In your worker setup, listen for the 'failed' event
worker.on('failed', async (job, err) => {
if (job.attemptsMade >= job.opts.attempts) {
// Final failure
console.error(`Job ${job.id} failed permanently:`, err.message);
// Move it to the DLQ
await deadLetterQueue.add('dead-letter', {
originalJobId: job.id,
data: job.data,
error: err.message,
failedAt: new Date().toISOString(),
});
// Maybe send a PagerDuty/Slack alert here
}
});
You can then use the BullMQ Board dashboard (or RedisInsight) to inspect DLQ jobs and decide whether to requeue them with fixes or delete them.
Worker Concurrency: Don’t Trigger Rate Limits Yourself
The concurrency setting on your worker is a critical dial. Set it too high, and you’ll overwhelm your own infrastructure or hit provider rate limits instantly. Set it too low, and you’re leaving throughput on the table.
You need to calculate it:
- LLM Provider Limits: If OpenAI allows 10,000 TPM (tokens per minute) and your average job uses 1,000 tokens, you can theoretically process 10 jobs per minute per API key.
- Average Job Duration: If one job takes ~30 seconds, a single worker process can do 2 jobs/minute.
- The Math: To achieve 10 jobs/minute, you need
10 / 2 = 5concurrent workers.
Benchmark: BullMQ job processing can handle ~50,000 jobs/min on a single Redis instance with 8 workers. Your bottleneck will almost always be the external API, not Redis or BullMQ.
| Concurrency | Avg Job Duration | Theoretical Max Jobs/Min | Risk |
|---|---|---|---|
| 1 | 30s | 2 | Low, but slow. |
| 5 | 30s | 10 | Matches our example API limit. Optimal. |
| 20 | 30s | 40 | Will immediately hit rate limits and cause retries. Dangerous. |
Start low, measure your actual throughput and error rates, and increase concurrency cautiously.
Monitoring: BullMQ Board and Prometheus
If you can’t see it, it’s broken. BullMQ provides a built-in UI, the BullMQ Board, which gives you a real-time view of all your queues, jobs, and workers. It’s essential for development and debugging.
For production, you need metrics. Instrument your workers to emit metrics for Prometheus:
- Job duration histogram
- Jobs completed/failed counters (by job name)
- Queue length gauge
import client from 'prom-client';
const jobDuration = new client.Histogram({
name: 'bullmq_job_duration_seconds',
help: 'Duration of BullMQ jobs in seconds',
labelNames: ['queue', 'job_name'],
buckets: [0.1, 1, 5, 10, 30, 60],
});
worker.on('completed', (job) => {
// You'd need to track start time, but BullMQ provides job timestamps
const duration = (job.finishedOn - job.processedOn) / 1000;
jobDuration.observe({ queue: job.queueName, job_name: job.name }, duration);
});
This lets you set alerts for queue backlog growth or spike in failure rates.
Real Errors You Will Hit (and How to Fix Them)
WRONGTYPE Operation against a key holding the wrong kind of value- Cause: BullMQ expects certain keys to be specific data types (e.g., a list). If you previously used the same key name for a string (like a cache), Redis will block the operation.
- Fix: Use
DEL yourqueuename:*to clean up before development, or rigorously namespace your keys. BullMQ’s keys are predictable (yourqueuename:wait,yourqueuename:meta).
OOM command not allowed when used memory > 'maxmemory'- Cause: Your Redis instance is full. By default, Redis uses
noeviction, which fails writes when full. This will stop your queue dead. - Fix: For a queue,
noevictionis correct—you don’t want jobs evicted from memory. You must provision enough memory and monitorused_memory. Setmaxmemoryinredis.confto a safe value (e.g., 70% of system RAM) and ensure you have alerting.
- Cause: Your Redis instance is full. By default, Redis uses
Next Steps: From Working Queue to Production System
You now have a robust queue. To harden it:
- Use a Managed Connection: Use
iorediswith automatic reconnection and Redis Sentinel/Cluster support. A blockingBLPOP(or BullMQ equivalent) with a timeout longer than your client's TCP timeout can cause silent drops. Configure timeouts carefully. - Separate Redis Instances: Don’t run your queue on the same Redis instance as your application cache. A cache stampede causing memory pressure (
OOM) will take your queue down. Use a dedicated Redis instance or database index. - Plan for Redis Persistence: Understand the trade-offs. RDB (snapshots) is faster for restarting a large queue. AOF (append-only file) is more durable, ensuring no jobs are lost on a crash. For a critical job queue, use
appendfsync everysecat a minimum. - Scale Horizontally: Add more worker processes, even on different machines. They all connect to the same Redis instance. For massive scale, look at Redis Cluster (automatic sharding). If you see
CLUSTERDOWN Hash slot not served, useredis-cli --cluster fix <host>:<port>to recover from a node failure.
Your job queue is now a reliable, observable, and scalable system. The 504 errors are gone. Your GPU can focus on what it does best—crunching matrices—while Redis and BullMQ handle the messy business of coordination and resilience. Move fast, but don’t break things silently.