YOLOv11 Real-Time Object Detection: Camera to FastAPI Production API

Problem: YOLOv11 Works in Scripts but Not in Production

You can run yolo predict on a video file. But serving real-time detections from a webcam through an HTTP API? That requires a frame pipeline, async inference, and proper streaming — none of which are obvious from the docs.

You'll learn:

How to read webcam frames and run YOLOv11 inference in a background thread
How to stream annotated frames as MJPEG via FastAPI
How to expose a /detect JSON endpoint for structured detection results
How to containerize it for production deployment

Time: 30 min | Level: Intermediate

Why This Happens

YOLOv11 (from Ultralytics) runs inference synchronously. A naive implementation blocks the web server while the model processes each frame. The result: dropped frames, high latency, and timeouts under load.

The fix is a producer-consumer pattern: a background thread grabs frames and runs inference continuously, while FastAPI serves the latest result to any connected client.

Common symptoms without this pattern:

API hangs for 200–500ms per request
Webcam stutters or freezes
asyncio event loop blocked under concurrent requests

Solution

Step 1: Install Dependencies

pip install ultralytics fastapi uvicorn opencv-python-headless python-multipart

Use opencv-python-headless on servers — it skips GUI libraries that aren't available in containers.

Verify the install:

python -c "from ultralytics import YOLO; print(YOLO('yolo11n.pt').info())"

Expected: Model summary with layer counts. YOLOv11n downloads automatically (~6 MB).

If it fails:

CUDA errors: Install torch separately first with your CUDA version from pytorch.org
libGL.so missing: You're on headless Linux — make sure you installed opencv-python-headless, not opencv-python

Step 2: Build the Frame Pipeline

This is the core: a thread that reads webcam frames and runs inference in a loop, storing the latest result for FastAPI to serve.

# pipeline.py
import threading
import cv2
import time
from ultralytics import YOLO
from dataclasses import dataclass, field
from typing import Optional
import numpy as np

@dataclass
class DetectionResult:
    boxes: list        # [{label, confidence, x1, y1, x2, y2}]
    annotated_frame: Optional[np.ndarray] = None
    timestamp: float = field(default_factory=time.time)


class DetectionPipeline:
    def __init__(self, model_path: str = "yolo11n.pt", camera_index: int = 0):
        self.model = YOLO(model_path)
        self.cap = cv2.VideoCapture(camera_index)
        self._latest: Optional[DetectionResult] = None
        self._lock = threading.Lock()
        self._running = False

    def start(self):
        self._running = True
        # Run inference in a daemon thread — dies when the main process exits
        t = threading.Thread(target=self._loop, daemon=True)
        t.start()

    def stop(self):
        self._running = False
        self.cap.release()

    def _loop(self):
        while self._running:
            ok, frame = self.cap.read()
            if not ok:
                time.sleep(0.01)
                continue

            # Run inference — returns a Results object
            results = self.model(frame, verbose=False)[0]

            boxes = []
            for box in results.boxes:
                label = self.model.names[int(box.cls)]
                conf = float(box.conf)
                x1, y1, x2, y2 = map(int, box.xyxy[0].tolist())
                boxes.append({
                    "label": label,
                    "confidence": round(conf, 3),
                    "x1": x1, "y1": y1, "x2": x2, "y2": y2
                })

            # plot() draws bounding boxes on a copy of the frame
            annotated = results.plot()

            with self._lock:
                self._latest = DetectionResult(boxes=boxes, annotated_frame=annotated)

    def get_latest(self) -> Optional[DetectionResult]:
        with self._lock:
            return self._latest

The threading.Lock prevents a race condition between the inference thread writing _latest and FastAPI reading it.

Step 3: Build the FastAPI Application

# main.py
import cv2
import io
import asyncio
from fastapi import FastAPI
from fastapi.responses import StreamingResponse, JSONResponse
from contextlib import asynccontextmanager
from pipeline import DetectionPipeline

pipeline = DetectionPipeline(model_path="yolo11n.pt", camera_index=0)


@asynccontextmanager
async def lifespan(app: FastAPI):
    # Start the pipeline when the server starts, stop it on shutdown
    pipeline.start()
    yield
    pipeline.stop()


app = FastAPI(title="YOLOv11 Detection API", lifespan=lifespan)


@app.get("/detect")
async def detect():
    """Returns the latest detection results as JSON."""
    result = pipeline.get_latest()
    if result is None:
        return JSONResponse({"boxes": [], "message": "warming up"}, status_code=202)
    return {"boxes": result.boxes, "timestamp": result.timestamp}


@app.get("/stream")
async def stream():
    """Streams annotated frames as MJPEG."""
    async def generate():
        while True:
            result = pipeline.get_latest()
            if result is None or result.annotated_frame is None:
                await asyncio.sleep(0.033)  # ~30 fps cap
                continue

            # Encode frame as JPEG
            _, jpeg = cv2.imencode(".jpg", result.annotated_frame, [cv2.IMWRITE_JPEG_QUALITY, 80])
            frame_bytes = jpeg.tobytes()

            # MJPEG multipart boundary
            yield (
                b"--frame\r\n"
                b"Content-Type: image/jpeg\r\n\r\n" +
                frame_bytes +
                b"\r\n"
            )
            await asyncio.sleep(0.033)

    return StreamingResponse(
        generate(),
        media_type="multipart/x-mixed-replace; boundary=frame"
    )


@app.get("/health")
async def health():
    result = pipeline.get_latest()
    return {"status": "ok", "has_frame": result is not None}

multipart/x-mixed-replace is the MJPEG standard — browsers render it as a live video element using a plain <img src="/stream"> tag.

Step 4: Run It

uvicorn main:app --host 0.0.0.0 --port 8000 --workers 1

Why --workers 1: The pipeline holds a camera handle and model in memory. Multiple workers would fight over the webcam device. Use a single worker and let the async event loop handle concurrent HTTP clients.

Open your browser:

http://localhost:8000/stream — live annotated video
http://localhost:8000/detect — JSON detections
http://localhost:8000/docs — auto-generated Swagger UI

If the stream is black or returns 202:

Wait 1–2 seconds for the model to warm up on the first frame
Check http://localhost:8000/health — has_frame: true means inference is running

Step 5: Containerize for Production

# Dockerfile
FROM python:3.11-slim

WORKDIR /app

# Install system deps for OpenCV headless
RUN apt-get update && apt-get install -y libglib2.0-0 && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Pre-download the model at build time — avoids runtime download in prod
RUN python -c "from ultralytics import YOLO; YOLO('yolo11n.pt')"

COPY . .

EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]

# requirements.txt
ultralytics==8.3.*
fastapi==0.115.*
uvicorn[standard]==0.30.*
opencv-python-headless==4.10.*

Build and run:

docker build -t yolo-api .

# Pass through the webcam device on Linux
docker run --device /dev/video0 -p 8000:8000 yolo-api

# On macOS with no /dev/video0, use a video file instead
docker run -p 8000:8000 -e CAMERA_SOURCE=/app/test.mp4 -v $(pwd)/test.mp4:/app/test.mp4 yolo-api

If you need GPU inference in Docker:

docker run --gpus all --device /dev/video0 -p 8000:8000 yolo-api

This requires the NVIDIA Container Toolkit installed on the host.

Verification

# Check JSON detections
curl http://localhost:8000/detect

# Check health
curl http://localhost:8000/health

You should see:

{
  "boxes": [
    {"label": "person", "confidence": 0.891, "x1": 120, "y1": 45, "x2": 380, "y2": 620}
  ],
  "timestamp": 1741190400.123
}

Open http://localhost:8000/stream in a browser — you should see the live annotated feed with bounding boxes.

Terminal showing uvicorn startup and first inference logs Server ready — first inference usually takes 300–500ms while the model warms up

Browser showing MJPEG stream with bounding boxes Live stream at /stream — bounding boxes update in real time

Switching Models

YOLOv11 ships in five sizes. Swap model_path in DetectionPipeline.__init__ based on your hardware:

Model	Size	Speed (CPU)	mAP
yolo11n.pt	6 MB	~40ms/frame	39.5
yolo11s.pt	19 MB	~80ms/frame	47.0
yolo11m.pt	68 MB	~180ms/frame	51.5
yolo11l.pt	87 MB	~250ms/frame	53.4
yolo11x.pt	109 MB	~400ms/frame	54.7

For edge deployments, use yolo11n.pt and export to ONNX or TensorRT for an additional 2–4x speedup:

# Export once, then load the ONNX file
model = YOLO("yolo11n.pt")
model.export(format="onnx")

# Use the exported model
pipeline = DetectionPipeline(model_path="yolo11n.onnx")

What You Learned

The producer-consumer thread pattern decouples inference from HTTP serving — this is the key to non-blocking real-time detection
MJPEG streaming works in any browser with a plain <img> tag, no WebSocket or WebRTC needed
Use --workers 1 with Uvicorn when your app holds singleton hardware resources like a camera
Pre-download models at Docker build time to avoid cold-start delays in production

Limitation: This setup handles one camera per process. For multi-camera deployments, run separate processes per camera and add a gateway (nginx, Traefik) in front.

When NOT to use this pattern: If you don't need live video, use model.predict(image_path) in a stateless endpoint instead — simpler and easier to scale horizontally.

Tested on YOLOv11 (Ultralytics 8.3.x), FastAPI 0.115, Python 3.11, Ubuntu 24.04 and macOS 15