FastAPI Background Tasks vs Celery: TL;DR
| FastAPI Background Tasks | Celery | |
|---|---|---|
| Setup time | ~5 min | ~30 min |
| Broker required | ❌ | ✅ Redis / RabbitMQ |
| Task retries | Manual | Built-in |
| Task queue visibility | ❌ | ✅ Flower, custom |
| Worker scaling | Same process | Independent workers |
| Persistent task state | ❌ | ✅ |
| Best for AI use | Short inference calls (<5s) | Long GPU jobs, RAG pipelines |
Choose FastAPI Background Tasks if: you're firing short post-request tasks — sending a webhook, logging an inference call, or warming a cache — and don't want broker infrastructure.
Choose Celery if: your AI workload runs longer than a few seconds, needs retries on failure, or you want to scale GPU workers independently from your API.
What We're Comparing
In 2026, most AI backends are FastAPI services. The moment you need to run something after the HTTP response — an LLM call, an embedding job, a RAG pipeline — you hit a fork: use FastAPI's built-in BackgroundTasks or reach for Celery. The wrong choice either over-engineers a simple problem or quietly drops tasks under load.
FastAPI Background Tasks Overview
FastAPI's BackgroundTasks runs a function in the same process after the response is sent. No broker, no worker, no config.
from fastapi import FastAPI, BackgroundTasks
app = FastAPI()
def run_embedding(text: str):
# Runs after response is returned to the client
vector = embed_model.encode(text)
db.store(vector)
@app.post("/ingest")
async def ingest(text: str, background_tasks: BackgroundTasks):
background_tasks.add_task(run_embedding, text)
return {"status": "accepted"}
The task runs in the same event loop thread (for async def) or in a thread pool (for regular def). Either way, it shares memory with the API process.
Pros:
- Zero infrastructure — no Redis, no broker, no worker process
- Shares in-memory state with the API (model loaded once, reused)
- Dead simple to reason about; easy to test
Cons:
- If the API process crashes mid-task, the task is gone — no persistence
- No retry logic; failed tasks fail silently unless you add it yourself
- Scales only as far as the API process scales — one bottleneck for everything
- Long-running GPU tasks block the thread pool and degrade API latency
Celery Overview
Celery is a distributed task queue. Your FastAPI app enqueues a task; a separate worker process picks it up from a broker (Redis or RabbitMQ) and executes it. Results can be stored in a backend (Redis, PostgreSQL).
# tasks.py
from celery import Celery
celery_app = Celery("worker", broker="redis://localhost:6379/0", backend="redis://localhost:6379/1")
@celery_app.task(bind=True, max_retries=3, default_retry_delay=10)
def run_llm_pipeline(self, prompt: str, user_id: str):
try:
result = llm_chain.invoke(prompt)
db.store(user_id, result)
except Exception as exc:
raise self.retry(exc=exc)
# main.py
from fastapi import FastAPI
from tasks import run_llm_pipeline
app = FastAPI()
@app.post("/generate")
async def generate(prompt: str, user_id: str):
task = run_llm_pipeline.delay(prompt, user_id)
return {"task_id": task.id}
@app.get("/result/{task_id}")
async def result(task_id: str):
task = run_llm_pipeline.AsyncResult(task_id)
return {"status": task.status, "result": task.result}
Pros:
- Tasks persist in the broker — API crash doesn't lose work
- Built-in retries, exponential backoff, dead-letter queues
- Workers scale independently: 1 API pod, 8 GPU workers is a valid topology
- Visibility via Flower or custom dashboards
- Priority queues: route fast embedding tasks and slow 70B generation to different workers
Cons:
- Requires Redis or RabbitMQ — more infrastructure, more failure modes
- Serialization overhead: all task arguments pass through the broker (don't pass model objects)
- Debugging distributed tasks is harder than debugging in-process code
- Cold-start latency: worker must load the model before processing the first task
Head-to-Head: AI Workload Scenarios
Short Inference: Embedding on Ingest
You receive a document, want to embed it and store it, then return a 200 immediately.
FastAPI BackgroundTasks wins here. The embedding model is already loaded in the API process. No serialization, no broker round-trip. P99 task latency is under 500ms for most embedding models on CPU.
# Model loaded once at startup — BackgroundTasks reuses it for free
@app.on_event("startup")
async def load_models():
app.state.embedder = SentenceTransformer("all-MiniLM-L6-v2")
def embed_and_store(text: str):
vec = app.state.embedder.encode(text)
vector_db.upsert(vec)
@app.post("/ingest")
async def ingest(text: str, bg: BackgroundTasks):
bg.add_task(embed_and_store, text)
return {"ok": True}
With Celery, the worker also loads the model — but you pay for broker serialization and worker dispatch on every request. For high-throughput embedding, that overhead adds up.
Long Generation: 70B LLM Inference
A user submits a prompt that takes 20–60 seconds to generate. You can't hold the HTTP connection open that long in most production setups.
Celery wins here. You return a task_id immediately, the worker runs inference on a GPU, and the client polls for results.
@celery_app.task(bind=True, queue="gpu", max_retries=2)
def generate_response(self, prompt: str, user_id: str):
# This runs on a dedicated GPU worker — API is unaffected
output = ollama.generate(model="llama3.3:70b-instruct-q4_0", prompt=prompt)
cache.set(f"result:{user_id}", output, ex=3600)
return output["response"]
With FastAPI BackgroundTasks, a slow LLM call runs in the thread pool. Under concurrent load, you'll exhaust the thread pool and start degrading API response times — or worse, hitting timeouts while the task is mid-generation.
RAG Pipeline with Retrieval + Reranking
A typical RAG flow: retrieve from vector DB → rerank → generate → return. This might take 5–15 seconds and involves three external calls, each of which can fail independently.
Celery wins here, specifically because of retries. If the vector DB times out on step 1, Celery retries the whole task. BackgroundTasks has no retry mechanism — you'd need to build it yourself.
@celery_app.task(bind=True, max_retries=3, default_retry_delay=5)
def rag_pipeline(self, query: str, user_id: str):
try:
docs = vector_db.similarity_search(query, k=10) # Can fail
ranked = reranker.rerank(query, docs)[:3] # Can fail
answer = llm.invoke(build_prompt(query, ranked)) # Can fail
db.store_answer(user_id, answer)
except Exception as exc:
raise self.retry(exc=exc)
Burst Traffic: 100 Requests in 10 Seconds
Both approaches handle bursts differently. BackgroundTasks queues work in the thread pool — if 100 tasks arrive faster than they complete, thread pool exhaustion degrades everything in the process.
Celery queues tasks in Redis. Workers drain the queue at their own pace. Your API stays responsive. You can add workers dynamically (Kubernetes HPA on queue depth) without touching API replicas.
Production Setup: Celery with FastAPI
Here's a minimal production-ready configuration for AI workloads.
docker-compose.yml:
services:
api:
build: .
command: uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4
environment:
- REDIS_URL=redis://redis:6379/0
worker-cpu:
build: .
command: celery -A tasks worker -Q cpu --concurrency 8 --loglevel info
environment:
- REDIS_URL=redis://redis:6379/0
worker-gpu:
build: .
command: celery -A tasks worker -Q gpu --concurrency 1 --loglevel info
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
redis:
image: redis:7-alpine
command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru
flower:
image: mher/flower:2.0
command: celery --broker=redis://redis:6379/0 flower
ports:
- "5555:5555"
tasks.py with routing:
from celery import Celery
celery_app = Celery("ai_worker")
celery_app.config_from_object({
"broker_url": "redis://redis:6379/0",
"result_backend": "redis://redis:6379/1",
"task_routes": {
"tasks.embed_text": {"queue": "cpu"},
"tasks.generate_response": {"queue": "gpu"},
"tasks.rag_pipeline": {"queue": "gpu"},
},
# Prevent large model outputs from clogging the broker
"result_expires": 3600,
"task_serializer": "json",
"result_serializer": "json",
})
@celery_app.task(queue="gpu", bind=True, max_retries=2, default_retry_delay=15)
def generate_response(self, prompt: str, task_id: str):
try:
result = ollama_client.generate(model="llama3.3:70b-instruct-q4_0", prompt=prompt)
return result["response"]
except Exception as exc:
raise self.retry(exc=exc)
@celery_app.task(queue="cpu", bind=True, max_retries=3)
def embed_text(self, text: str, doc_id: str):
try:
vec = embedder.encode(text).tolist()
vector_db.upsert(doc_id, vec)
except Exception as exc:
raise self.retry(exc=exc)
When BackgroundTasks Is Enough
Don't over-engineer. FastAPI BackgroundTasks is the right call when:
- The task completes in under 3 seconds reliably
- Losing the task on crash is acceptable (audit logs, cache warming, analytics pings)
- You're in early development and want to ship fast
- You're running on a single server where broker overhead isn't worth it
A common pattern: start with BackgroundTasks, then migrate hot paths to Celery once you hit real load or need retries. The migration is straightforward — move the function body to a Celery task, change background_tasks.add_task(fn, args) to fn.delay(args).
Which Should You Use?
Pick FastAPI BackgroundTasks when:
- Task duration is under 3–5 seconds
- You're building a prototype or low-traffic internal tool
- The task is stateless and losing it on crash is fine
- You're embedding short texts or warming caches post-request
Pick Celery when:
- You're running LLM inference longer than 5 seconds
- You need retries — RAG pipelines, external API calls, anything that can fail
- You want to scale GPU workers independently from API replicas
- You need task visibility: which tasks failed, how long they took, queue depth
Use both when: you want fast in-process post-processing (cache warm, analytics) on the API side, plus durable GPU jobs on the Celery side. There's no rule against running both in the same service.
FAQ
Q: Can I use FastAPI BackgroundTasks with async LLM calls?
A: Yes, if your LLM client is async (e.g., AsyncOpenAI, async Ollama client). Define the task as async def and FastAPI runs it in the event loop. This avoids thread pool exhaustion but still means a crashed process loses in-flight tasks.
Q: What's the best Celery broker for AI workloads in 2026?
A: Redis is the default choice — low latency, easy to operate, handles typical AI task throughput. Use RabbitMQ if you need message durability guarantees (tasks survive broker restart) or complex routing. For most teams, Redis 7 with allkeys-lru eviction is fine.
Q: Does Celery work with async FastAPI code?
A: Celery workers run synchronous code by default. For async tasks, use celery[asyncio] or run asyncio.run() inside the task. Alternatively, use arq — a Redis-based async task queue built for Python's asyncio. It's lighter than Celery and pairs naturally with async FastAPI apps.
Q: How do I monitor Celery task failures in production?
A: Flower gives you a real-time UI (included in the docker-compose above). For alerting, integrate with your observability stack: emit task failure events to your logger with on_failure signal hooks, or use Sentry's Celery integration which auto-captures task exceptions with full stack traces.
Tested with FastAPI 0.115, Celery 5.4, Redis 7.2, Python 3.12, Ubuntu 24.04