Your ML model serves 200 req/s on one GPU. With Ray Serve dynamic batching, the same GPU handles 2,000 req/s. That’s not magic—it’s just your GPU finally working at capacity instead of sipping power while processing one sad request at a time. You’ve containerized your model, maybe even deployed it with a simple Flask app, and hit the throughput wall. Your MLOps stack tracks experiments with MLflow (those 17M+ monthly downloads don’t lie), versions data with DVC, and then… your production inference endpoint crumbles under real traffic. This is where model serving becomes an engineering discipline, not an afterthought.
We’re moving past the basics. This guide is for when you need to serve models at scale, share expensive GPUs, and update them without waking up at 3 a.m. We’ll use Ray Serve—not because it’s the only option, but because it turns the complex problem of high-performance serving into a Python-native API.
Ray Serve Architecture: It’s Just a Distributed Python Queue
Forget the complex diagrams. At its core, Ray Serve runs your Python class (a “deployment”) on one or more Ray actors (“replicas”). An ingress router receives HTTP/gRPC requests and sends them to available replicas. The key is that these replicas are just Python processes distributed across your cluster (which can be your single beefy machine). You’re not configuring YAML for a separate serving engine; you’re writing Python that gets scheduled and load-balanced by Ray.
Think of it as a concurrent asyncio application that can span multiple machines. A replica is a single instance of your deployment class. If you set num_replicas=4, you have four copies of your model loaded in memory (or VRAM), sharing the request load. The router uses a simple round-robin policy by default, but the real power comes from what happens inside the replica.
Dynamic Batching: Making Your GPU Sweat
Your GPU’s compute cores are massively parallel. Feeding them a single inference request at a time is like using a cargo ship to deliver one envelope. Dynamic batching is the solution: holding incoming requests for a short time to group them into a single batch, which your model (PyTorch, TensorFlow, JAX) can process far more efficiently.
Without batching, each request incurs the full overhead of GPU kernel launches and memory transfers. With batching, that overhead is amortized. The throughput gains are not linear—they’re exponential at low batch sizes—because you’re finally saturating the hardware.
In Ray Serve, you enable this with the @serve.deployment decorator. Here’s a concrete example using a PyTorch model, integrated with your MLflow model registry for loading.
import torch
import torch.nn.functional as F
from ray import serve
from typing import List
import mlflow.pyfunc
@serve.deployment(
ray_actor_options={"num_gpus": 1},
autoscaling_config={"min_replicas": 1, "max_replicas": 4},
max_ongoing_requests=100, # Backpressure control
)
class MLflowTorchDeployment:
def __init__(self, model_uri: str):
# Load model directly from MLflow registry
self.model = mlflow.pyfunc.load_model(model_uri).unwrap_python_model()
self.model.eval()
# Warm-up with a dummy batch
with torch.no_grad():
_ = self.model(torch.randn(1, 3, 224, 224).cuda())
@serve.batch(max_batch_size=32, batch_wait_timeout_s=0.1)
async def handle_batch(self, inputs: List[torch.Tensor]):
with torch.no_grad():
batch = torch.stack(inputs).cuda()
predictions = self.model(batch)
return F.softmax(predictions, dim=1).cpu().tolist()
async def __call__(self, request) -> List:
image_tensor = await request.json() # Assume preprocessed tensor
tensor = torch.tensor(image_tensor)
return await self.handle_batch(tensor)
The @serve.batch decorator is the key. It automatically queues up to max_batch_size requests (32 here) for up to batch_wait_timeout_s seconds (0.1s), then passes the list of inputs to your method. The 0.1-second timeout is critical: it’s the trade-off between latency and throughput. For high-throughput, offline processing, you might increase this; for user-facing APIs, you’ll keep it aggressive.
Real Error & Fix:
You’ll likely hit this from MLflow: Run not found or already finished. This happens when you try to log metrics after a run has closed.
Fix: Always use mlflow.start_run() as a context manager. Never call mlflow.end_run() manually.
with mlflow.start_run():
mlflow.log_metric("accuracy", 0.95)
# The run ends automatically here
GPU Sharing: Serving Three Models on One A100 with Fractional GPU
Your A100 80GB has ~100 teraflops of compute. One 7B-parameter model might use 15GB of VRAM and only 20% utilization. You can serve multiple models on the same GPU concurrently using fractional GPU allocation in Ray. This isn’t time-slicing—it’s true concurrent execution if the models are small enough to fit in memory together.
You specify fractional GPUs in ray_actor_options. For example, {"num_gpus": 0.5} allocates half the memory and a proportional share of compute streams. Here’s how you deploy three different models from your registry onto one physical GPU.
from ray import serve
import mlflow
# Assume these URIs are from your MLflow Model Registry
model_uris = {
"resnet": "models:/ResNet50/Production",
"bert": "models:/BERT-Sentiment/Production",
"object_detector": "models:/YOLOv8/Staging"
}
deployments = []
for name, uri in model_uris.items():
@serve.deployment(
name=name,
route_prefix=f"/{name}",
ray_actor_options={"num_gpus": 0.33}, # Share one GPU three ways
max_ongoing_requests=50,
)
class SingleModelDeployment:
def __init__(self):
self.model = mlflow.pyfunc.load_model(uri)
async def __call__(self, request):
data = await request.json()
return self.model.predict(data)
deployments.append(SingleModelDeployment)
# Bind and deploy all
app = deployments[0].bind()
for dep in deployments[1:]:
app = app.options(name=dep.name, route_prefix=dep.route_prefix).bind()
serve.run(app, name="multi_model_app")
Now, hit /resnet, /bert, and /object_detector on the same endpoint. The models coexist in VRAM. Monitor utilization with nvidia-smi; you’ll see memory split and compute usage aggregate. This is how you go from 200 req/s per model to a combined 1500+ req/s on a single GPU.
Autoscaling Policies: Scale on Latency p99, Not Replica Count
Autoscaling based on CPU/RAM is useless for ML inference. Your bottleneck is GPU memory or compute, and your SLO is latency. Ray Serve’s autoscaling can scale based on the p99 (or p95) tail latency of requests.
You configure this in the autoscaling_config. The replica will scale up when the rolling average of p99 latency exceeds a threshold, indicating requests are queuing too long.
autoscaling_config={
"min_replicas": 1,
"max_replicas": 8,
"target_num_ongoing_requests_per_replica": 50, # Aim for this queue depth per replica
"metrics_interval_s": 10.0,
"look_back_period_s": 30.0,
"upscale_delay_s": 30, # Wait 30s before scaling up to avoid flapping
"downscale_delay_s": 300, # Wait 5 minutes before scaling down
}
The target_num_ongoing_requests_per_replica is the core metric. It’s the number of requests a replica is currently processing + queued. If this is consistently above your target (e.g., 50), Ray Serve adds replicas. This is far more responsive than CPU metrics because it directly measures load as perceived by the end user.
Zero-Downtime Model Updates: Blue-Green with Traffic Splitting
Promoting a new model version from Staging to Production in MLflow shouldn’t require downtime. Ray Serve supports blue-green deployments through its deployment graph API. You deploy the new version alongside the old, split a percentage of traffic to it, validate metrics, and then fully cut over.
First, in your MLflow registry, transition the old production model to Archived and promote the new one to Production. The registry URI models:/MyModel/Production will now point to the new version.
Real Error & Fix:
When promoting models, you may get: Model registry version conflict: model already in Production stage.
Fix: You must transition the old version to Archived before promoting the new version. Do this via the MLflow UI or API:
client.transition_model_version_stage(
name="MyModel",
version=3,
stage="Archived"
)
In Ray Serve, you can perform a canary rollout:
from ray.serve.handle import DeploymentHandle
# Existing deployment (v1)
v1_deployment = MLflowTorchDeployment.bind("models:/MyModel/Archived")
# New deployment (v2)
v2_deployment = MLflowTorchDeployment.bind("models:/MyModel/Production")
@serve.deployment(route_prefix="/canary")
class CanaryRouter:
def __init__(self, v1_handle: DeploymentHandle, v2_handle: DeploymentHandle):
self.v1 = v1_handle
self.v2 = v2_handle
self.traffic_split = 0.1 # 10% to v2 initially
async def __call__(self, request):
import random
if random.random() < self.traffic_split:
return await self.v2.remote(request)
else:
return await self.v1.remote(request)
# Deploy the graph
app = CanaryRouter.bind(v1_deployment, v2_deployment)
serve.run(app)
Monitor your Evidently AI dashboards for drift or performance degradation on the 10% traffic. If all looks good, gradually increase traffic_split to 1.0. Then, update the main deployment to point solely to v2 and remove the router. Zero downtime, full validation.
FastAPI Integration: Adding Preprocessing and Postprocessing Logic
Your model expects a cleaned tensor, but your API receives raw JSON. You need preprocessing (like image resizing) and postprocessing (like mapping logits to labels). Wrap your Ray Serve deployment in a FastAPI app for full OpenAPI docs and middleware support.
Ray Serve natively integrates with FastAPI. You define the app and mount the deployment.
from fastapi import FastAPI, UploadFile
from ray import serve
from PIL import Image
import io
import torchvision.transforms as transforms
import mlflow
app = FastAPI(title="Model Serving API")
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
@serve.deployment(ray_actor_options={"num_gpus": 0.5})
@serve.ingress(app)
class FastAPIWrappedDeployment:
def __init__(self):
self.model = mlflow.pyfunc.load_model("models:/ResNet50/Production")
@app.post("/predict")
async def predict(self, file: UploadFile):
# 1. Preprocess
contents = await file.read()
image = Image.open(io.BytesIO(contents)).convert("RGB")
tensor = transform(image).unsqueeze(0) # Add batch dim
# 2. Inference (could call batched method here)
with torch.no_grad():
prediction = self.model(tensor.cuda())
# 3. Postprocess
predicted_class = prediction.argmax().item()
return {"class_id": predicted_class, "confidence": prediction.max().item()}
# Deploy
serve.run(FastAPIWrappedDeployment.bind(), route_prefix="/")
Now you have a full OpenAPI spec at /docs. The preprocessing happens on the CPU (FastAPI layer), and the batched inference happens on the GPU (Ray Serve deployment). This separation keeps your GPU busy only with tensor operations.
Monitoring: Ray Dashboard, Prometheus Metrics, and Latency SLOs
Deploying is half the battle. You must monitor. Ray Dashboard gives you a real-time view of replicas, requests per second, and errors. But for production, you need Prometheus metrics and SLOs.
Ray Serve exposes metrics on :9999/metrics by default. Key metrics:
serve_deployment_request_counter: Total requests.serve_deployment_error_counter: Errors.serve_deployment_processing_latency_ms: Processing latency histogram.serve_replica_processing_queries: Current queries per replica (for autoscaling).
Set up alerts in Grafana:
- Latency SLO: p99 latency > 500ms for 5 minutes.
- Error Budget: Error rate > 1% for 10 minutes.
- Throughput: Request rate drops by 50% (possible deployment issue).
Integrate Evidently AI for drift detection. Schedule a daily job that samples production request data, compares it to the training set reference, and generates a report. If drift is detected (recall: it causes 30% accuracy degradation within 6 months without monitoring), trigger a retraining pipeline.
import evidently
from evidently.report import Report
from evidently.metrics import DataDriftTable
# Assume `reference` and `current` are pandas DataFrames of input features
report = Report(metrics=[DataDriftTable()])
report.run(reference_data=reference, current_data=current)
if report.as_dict()["metrics"][0]["result"]["dataset_drift"]:
# Trigger CI/CD pipeline: retrain, register, canary deploy
print("Drift detected! Initiating retraining...")
Benchmark Reality Check: Ray Serve vs. The World
Let’s ground this in numbers. Here’s a comparison of serving frameworks for a ResNet-50 inference workload, using the provided benchmark data.
| Framework | Hardware | Throughput (req/s) | Avg Latency (ms) | Key Differentiator |
|---|---|---|---|---|
| Ray Serve (with dynamic batching) | 8-core CPU | 12,000 | 15 | Native Python, best-in-class autoscaling |
| TorchServe | 8-core CPU | 8,000 | 22 | Tight PyTorch integration, less flexible |
| Flask (single worker) | 8-core CPU | ~350 | 2800 | No batching, baseline for "what not to do" |
| Ray Serve (frac GPU) | 1x A100 (3 models) | 1,800* | 45 | Efficient GPU sharing, multi-model |
*Combined throughput for three models sharing the GPU.
The 12,000 vs. 8,000 req/s difference on CPU is due to Ray’s more efficient async core and batching implementation. For GPU, the advantage is in fractional sharing and simpler multi-model deployment.
Next Steps: From Serving to a Full MLOps Pipeline
You’ve now got a high-throughput, scalable serving layer. The next evolution is to connect it into a fully automated MLOps pipeline.
- CI/CD for Models: Use GitHub Actions. On a merge to
main, trigger a pipeline that: trains a model (logging to MLflow/W&B), registers it, runs Evidently drift tests against a holdout set, and if all pass, updates the Ray Serve deployment via a canary rollout. DVC ensures the training data version is pinned. - Feature Store for Real-Time: Integrate Feast. Move preprocessing out of your FastAPI app. Instead, have your deployment fetch real-time features from a Redis-backed Feast online store. This cuts serving latency and ensures consistency between training and inference.
- Continuous Training: Set up a Prefect or Kubeflow pipeline that runs weekly. It pulls the latest versioned data with DVC (
DVC pull 10GB dataset from S3: 45sthanks to differential sync), retrains, validates, and if performance improves, automatically initiates the blue-green deployment process you’ve built. - Unified Monitoring: Create a single Grafana dashboard that combines: Ray Serve performance metrics (latency, throughput), Evidently drift scores, MLflow model version history, and business metrics (e.g., conversion rate per model version). This is your single pane of glass.
The goal is to move from “serving a model” to “managing a continuous, self-improving prediction service.” Ray Serve is the robust, scalable engine that makes this possible without forcing you into a labyrinth of YAML and proprietary tools. Your GPU is no longer crying—it’s finally earning its keep.