Modal serverless GPU compute for ML workloads lets you run training jobs, batch inference, and fine-tuning pipelines on A100s or H100s without provisioning a single VM. You write a Python function, decorate it, push it — Modal handles the container build, GPU allocation, and teardown.
You'll learn:
- How to set up Modal and define a GPU-backed function in Python 3.12
- Run a real inference workload using a Hugging Face model on an A100
- Schedule batch jobs and expose an inference endpoint with autoscaling
- Understand Modal's pricing model so you only pay for what you use
Time: 20 min | Difficulty: Intermediate
Why Modal Solves the GPU Cold-Start Problem
Every ML team hits the same wall: you need a GPU for 40 minutes, but cloud VMs bill by the hour and take 3–5 minutes to start. Kubernetes clusters sit idle 70% of the time. Spot instances interrupt training runs at the worst moment.
Modal's architecture sidesteps this entirely. Functions scale from zero to hundreds of GPU containers in under 2 seconds. You pay per GPU-second — an A10G costs $0.000306/sec, an A100-80GB costs $0.000463/sec. A 30-minute training run on an A100 costs roughly $0.83.
Symptoms of the old approach:
- Paying $2–4/hour for a GPU that's idle between experiments
- Jupyter notebooks that die mid-training when spot instances evict
- 10-minute cluster startup blocking iteration speed
- DevOps overhead managing CUDA versions and container registries
Modal's execution model: your decorated Python function → container snapshot → GPU allocation → auto-teardown
Prerequisites
- Python 3.12 (Modal's SDK is tested on 3.11–3.12)
- A Modal account — free tier includes $30 credits, no credit card required
uvfor dependency management (recommended) orpip
Solution
Step 1: Install Modal and Authenticate
# Install with uv (recommended — resolves deps 10–100x faster than pip)
uv add modal
# Or with pip
pip install modal --break-system-packages
# Authenticate — opens browser for token exchange
modal setup
Expected output:
Web authentication finished successfully!
Token stored at ~/.modal.toml
Modal tokens are scoped per workspace. If you're on a team, each member runs modal setup against the shared workspace slug.
If it fails:
modal: command not found→ runuv run modal setupor add~/.local/binto$PATH403 Forbidden→ token expired; runmodal token new
Step 2: Define Your First GPU Function
Create inference.py. The @app.function decorator is where you declare GPU type, memory, timeout, and container image.
import modal
# Define the app — one app per project, multiple functions allowed
app = modal.App("ml-inference")
# Build a custom image with your exact deps — cached after first build
image = (
modal.Image.debian_slim(python_version="3.12")
.pip_install(
"torch==2.2.2",
"transformers==4.40.0",
"accelerate==0.29.3",
)
)
@app.function(
image=image,
gpu="A10G", # A10G: 24GB VRAM, $0.000306/sec — good for 7B models
timeout=300, # Kill after 5 min — prevents runaway billing
memory=32768, # 32GB CPU RAM — headroom for tokenizer + model weights
)
def run_inference(prompt: str) -> str:
from transformers import pipeline
# Model loads inside the container — cached in Modal's image snapshot
pipe = pipeline(
"text-generation",
model="mistralai/Mistral-7B-Instruct-v0.3",
device_map="auto", # Distributes across available GPU memory automatically
)
result = pipe(prompt, max_new_tokens=256, do_sample=False)
return result[0]["generated_text"]
@app.local_entrypoint()
def main():
response = run_inference.remote("Explain gradient checkpointing in 3 sentences.")
print(response)
Run it:
modal run inference.py
Modal builds the image on first run (≈90 seconds). Subsequent runs reuse the cached snapshot and cold-start in under 2 seconds.
Expected output:
✓ Created objects.
✓ Running app...
Gradient checkpointing trades compute for memory by...
Step 3: Upgrade to A100 for Larger Models
Swap gpu="A10G" for gpu="A100-80GB" when loading 13B+ models or running batched inference at scale. The GPU class also accepts a count for multi-GPU jobs.
@app.function(
image=image,
gpu=modal.gpu.A100(memory=80, count=1), # Explicit: 80GB variant, 1 card
timeout=600,
retries=2, # Auto-retry on preemption — critical for long training runs
)
def run_large_inference(prompts: list[str]) -> list[str]:
from transformers import pipeline
import torch
pipe = pipeline(
"text-generation",
model="meta-llama/Meta-Llama-3-70B-Instruct",
device_map="auto",
torch_dtype=torch.bfloat16, # bfloat16 halves VRAM vs float32 with no accuracy loss on A100
)
results = pipe(prompts, max_new_tokens=128, batch_size=8)
return [r[0]["generated_text"] for r in results]
For multi-GPU training (e.g., DDP with 4×A100s):
gpu=modal.gpu.A100(memory=80, count=4)
Modal provisions all 4 cards on the same physical host — no network overhead between GPUs.
Step 4: Run Batch Jobs with .map()
Modal's .map() runs your function across an input list in parallel, spawning one container per item up to your concurrency limit. This is the fastest way to run batch inference or dataset preprocessing.
@app.local_entrypoint()
def batch_main():
prompts = [
"Summarize the transformer architecture.",
"What is LoRA fine-tuning?",
"Explain RLHF in simple terms.",
"What does KV cache do?",
]
# .map() fans out — each prompt gets its own GPU container
# Returns results in input order, not completion order
results = list(run_inference.map(prompts))
for prompt, result in zip(prompts, results):
print(f"Q: {prompt}\nA: {result}\n")
modal run inference.py::batch_main
Parallelism is controlled by modal.Concurrency on the function or your workspace's concurrency limit (default: 10 concurrent containers on the free tier, 100+ on Team).
Step 5: Deploy a Persistent Inference Endpoint
For production serving, swap @app.function for @modal.web_endpoint. Modal deploys it as a scalable HTTPS endpoint — zero containers at rest, auto-scales to demand.
from modal import web_endpoint
@app.function(
image=image,
gpu="A10G",
timeout=120,
keep_warm=1, # Keep 1 container warm — eliminates cold-start for the first request
)
@web_endpoint(method="POST")
def serve(item: dict) -> dict:
prompt = item.get("prompt", "")
response = run_inference(prompt) # Calls the function directly (same container)
return {"response": response}
Deploy:
modal deploy inference.py
Expected output:
✓ Created web endpoint: https://your-workspace--ml-inference-serve.modal.run
Test the live endpoint:
curl -X POST https://your-workspace--ml-inference-serve.modal.run \
-H "Content-Type: application/json" \
-d '{"prompt": "What is mixture of experts?"}'
keep_warm=1 costs roughly $0.022/hour on an A10G — about $16/month for an always-warm slot. Set keep_warm=0 to pay only on request.
Step 6: Schedule Training Jobs as Cron Functions
Modal supports cron-style scheduling via @app.function(schedule=...). Use this for nightly fine-tuning, daily embedding refresh, or model evaluation runs.
@app.function(
image=image,
gpu="A100-80GB",
timeout=7200, # 2-hour hard limit — prevents runaway training jobs
schedule=modal.Cron("0 2 * * *"), # 2 AM UTC daily
)
def nightly_finetune():
import subprocess
# Pull latest dataset from S3, run LoRA fine-tune, push checkpoint
subprocess.run([
"python", "train_lora.py",
"--model", "mistralai/Mistral-7B-v0.3",
"--data", "s3://your-bucket/dataset/latest.jsonl",
"--output", "s3://your-bucket/checkpoints/",
"--epochs", "1",
], check=True) # check=True raises on non-zero exit — triggers Modal's retry logic
Deploy once — Modal runs it on schedule without any cron daemon or always-on server.
Verification
Check running and past executions:
# List active deployments
modal app list
# Tail logs from the deployed endpoint
modal app logs ml-inference
# Check GPU utilization during a run
modal run inference.py --detach # Returns immediately, runs async
In the Modal dashboard at modal.com/apps, you'll see per-run GPU seconds, memory peaks, and cost breakdowns per function.
You should see your ml-inference app listed as deployed with the endpoint URL, last invocation timestamp, and cumulative cost.
Modal GPU Pricing vs Alternatives (USD, March 2026)
| Provider | A100 80GB | A10G 24GB | Cold Start | Min Billing |
|---|---|---|---|---|
| Modal | $1.67/hr ($0.000463/sec) | $1.10/hr | ~2 sec | Per second |
| RunPod (Spot) | $1.49/hr | $0.69/hr | 3–5 min | Per minute |
| Lambda Labs | $2.49/hr | N/A | 5–10 min | Per hour |
| AWS (p4d.xlarge) | $3.21/hr | N/A | 5–15 min | Per second |
| Google Colab Pro+ | N/A | Shared | Variable | Per month |
Modal's per-second billing and 2-second cold start make it cheapest for workloads under 45 minutes. For 8-hour training runs, RunPod spot or Lambda Labs becomes more competitive.
What You Learned
- Modal's
@app.functiondecorator handles GPU provisioning, container builds, and teardown — no YAML, no Kubernetes gpu="A10G"covers 7B models;gpu=modal.gpu.A100(memory=80)handles 70B+ and multi-GPU training.map()parallelizes batch jobs across isolated containers — fastest path for bulk inference@web_endpointwithkeep_warm=1gives you a production inference API at ~$16/month for warm capacityretries=2on long-running functions protects against spot preemption
Tested on Modal SDK 0.64, Python 3.12, transformers 4.40, PyTorch 2.2.2, Ubuntu 22.04 containers
FAQ
Q: Does Modal support multi-node training across multiple hosts? A: Not natively — Modal's multi-GPU support is single-host (up to 8×A100s on one machine). For true multi-node DDP or FSDP across hosts, use RunPod clusters or AWS SageMaker.
Q: How do I access private Hugging Face models on Modal?
A: Store your HF token as a Modal secret: modal secret create huggingface HF_TOKEN=hf_xxx. Reference it with secrets=[modal.Secret.from_name("huggingface")] in @app.function.
Q: What is the maximum timeout for a Modal function?
A: 24 hours (timeout=86400). For longer jobs, checkpoint to S3 and restart — Modal doesn't guarantee container continuity beyond 24 hours.
Q: Can Modal functions access AWS S3 or GCP storage?
A: Yes. Pass AWS credentials via modal.Secret.from_dotenv() or modal.Secret.from_name(). The container has full network egress, so boto3 and google-cloud-storage work without any special configuration.
Q: Does Modal work with PyTorch 2.x and CUDA 12?
A: Yes. Use modal.Image.debian_slim().pip_install("torch==2.2.2+cu121", index_url="https://download.pytorch.org/whl/cu121") to pin the exact CUDA 12.1 build.