Deploy ML Workloads on Modal Serverless GPU Compute 2026

Run Modal serverless GPU compute for ML workloads: training, inference, and batch jobs on A100s with Python 3.12. No cluster management. Starts at $0.000463/GPU-sec.

Modal serverless GPU compute for ML workloads lets you run training jobs, batch inference, and fine-tuning pipelines on A100s or H100s without provisioning a single VM. You write a Python function, decorate it, push it — Modal handles the container build, GPU allocation, and teardown.

You'll learn:

  • How to set up Modal and define a GPU-backed function in Python 3.12
  • Run a real inference workload using a Hugging Face model on an A100
  • Schedule batch jobs and expose an inference endpoint with autoscaling
  • Understand Modal's pricing model so you only pay for what you use

Time: 20 min | Difficulty: Intermediate


Why Modal Solves the GPU Cold-Start Problem

Every ML team hits the same wall: you need a GPU for 40 minutes, but cloud VMs bill by the hour and take 3–5 minutes to start. Kubernetes clusters sit idle 70% of the time. Spot instances interrupt training runs at the worst moment.

Modal's architecture sidesteps this entirely. Functions scale from zero to hundreds of GPU containers in under 2 seconds. You pay per GPU-second — an A10G costs $0.000306/sec, an A100-80GB costs $0.000463/sec. A 30-minute training run on an A100 costs roughly $0.83.

Symptoms of the old approach:

  • Paying $2–4/hour for a GPU that's idle between experiments
  • Jupyter notebooks that die mid-training when spot instances evict
  • 10-minute cluster startup blocking iteration speed
  • DevOps overhead managing CUDA versions and container registries

Modal Serverless GPU ML Workload Architecture Modal's execution model: your decorated Python function → container snapshot → GPU allocation → auto-teardown


Prerequisites

  • Python 3.12 (Modal's SDK is tested on 3.11–3.12)
  • A Modal account — free tier includes $30 credits, no credit card required
  • uv for dependency management (recommended) or pip

Solution

Step 1: Install Modal and Authenticate

# Install with uv (recommended — resolves deps 10–100x faster than pip)
uv add modal

# Or with pip
pip install modal --break-system-packages

# Authenticate — opens browser for token exchange
modal setup

Expected output:

Web authentication finished successfully!
Token stored at ~/.modal.toml

Modal tokens are scoped per workspace. If you're on a team, each member runs modal setup against the shared workspace slug.

If it fails:

  • modal: command not found → run uv run modal setup or add ~/.local/bin to $PATH
  • 403 Forbidden → token expired; run modal token new

Step 2: Define Your First GPU Function

Create inference.py. The @app.function decorator is where you declare GPU type, memory, timeout, and container image.

import modal

# Define the app — one app per project, multiple functions allowed
app = modal.App("ml-inference")

# Build a custom image with your exact deps — cached after first build
image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install(
        "torch==2.2.2",
        "transformers==4.40.0",
        "accelerate==0.29.3",
    )
)

@app.function(
    image=image,
    gpu="A10G",           # A10G: 24GB VRAM, $0.000306/sec — good for 7B models
    timeout=300,          # Kill after 5 min — prevents runaway billing
    memory=32768,         # 32GB CPU RAM — headroom for tokenizer + model weights
)
def run_inference(prompt: str) -> str:
    from transformers import pipeline

    # Model loads inside the container — cached in Modal's image snapshot
    pipe = pipeline(
        "text-generation",
        model="mistralai/Mistral-7B-Instruct-v0.3",
        device_map="auto",   # Distributes across available GPU memory automatically
    )
    result = pipe(prompt, max_new_tokens=256, do_sample=False)
    return result[0]["generated_text"]


@app.local_entrypoint()
def main():
    response = run_inference.remote("Explain gradient checkpointing in 3 sentences.")
    print(response)

Run it:

modal run inference.py

Modal builds the image on first run (≈90 seconds). Subsequent runs reuse the cached snapshot and cold-start in under 2 seconds.

Expected output:

✓ Created objects.
✓ Running app...
Gradient checkpointing trades compute for memory by...

Step 3: Upgrade to A100 for Larger Models

Swap gpu="A10G" for gpu="A100-80GB" when loading 13B+ models or running batched inference at scale. The GPU class also accepts a count for multi-GPU jobs.

@app.function(
    image=image,
    gpu=modal.gpu.A100(memory=80, count=1),  # Explicit: 80GB variant, 1 card
    timeout=600,
    retries=2,   # Auto-retry on preemption — critical for long training runs
)
def run_large_inference(prompts: list[str]) -> list[str]:
    from transformers import pipeline
    import torch

    pipe = pipeline(
        "text-generation",
        model="meta-llama/Meta-Llama-3-70B-Instruct",
        device_map="auto",
        torch_dtype=torch.bfloat16,  # bfloat16 halves VRAM vs float32 with no accuracy loss on A100
    )
    results = pipe(prompts, max_new_tokens=128, batch_size=8)
    return [r[0]["generated_text"] for r in results]

For multi-GPU training (e.g., DDP with 4×A100s):

gpu=modal.gpu.A100(memory=80, count=4)

Modal provisions all 4 cards on the same physical host — no network overhead between GPUs.


Step 4: Run Batch Jobs with .map()

Modal's .map() runs your function across an input list in parallel, spawning one container per item up to your concurrency limit. This is the fastest way to run batch inference or dataset preprocessing.

@app.local_entrypoint()
def batch_main():
    prompts = [
        "Summarize the transformer architecture.",
        "What is LoRA fine-tuning?",
        "Explain RLHF in simple terms.",
        "What does KV cache do?",
    ]

    # .map() fans out — each prompt gets its own GPU container
    # Returns results in input order, not completion order
    results = list(run_inference.map(prompts))

    for prompt, result in zip(prompts, results):
        print(f"Q: {prompt}\nA: {result}\n")
modal run inference.py::batch_main

Parallelism is controlled by modal.Concurrency on the function or your workspace's concurrency limit (default: 10 concurrent containers on the free tier, 100+ on Team).


Step 5: Deploy a Persistent Inference Endpoint

For production serving, swap @app.function for @modal.web_endpoint. Modal deploys it as a scalable HTTPS endpoint — zero containers at rest, auto-scales to demand.

from modal import web_endpoint

@app.function(
    image=image,
    gpu="A10G",
    timeout=120,
    keep_warm=1,   # Keep 1 container warm — eliminates cold-start for the first request
)
@web_endpoint(method="POST")
def serve(item: dict) -> dict:
    prompt = item.get("prompt", "")
    response = run_inference(prompt)  # Calls the function directly (same container)
    return {"response": response}

Deploy:

modal deploy inference.py

Expected output:

✓ Created web endpoint: https://your-workspace--ml-inference-serve.modal.run

Test the live endpoint:

curl -X POST https://your-workspace--ml-inference-serve.modal.run \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is mixture of experts?"}'

keep_warm=1 costs roughly $0.022/hour on an A10G — about $16/month for an always-warm slot. Set keep_warm=0 to pay only on request.


Step 6: Schedule Training Jobs as Cron Functions

Modal supports cron-style scheduling via @app.function(schedule=...). Use this for nightly fine-tuning, daily embedding refresh, or model evaluation runs.

@app.function(
    image=image,
    gpu="A100-80GB",
    timeout=7200,     # 2-hour hard limit — prevents runaway training jobs
    schedule=modal.Cron("0 2 * * *"),  # 2 AM UTC daily
)
def nightly_finetune():
    import subprocess

    # Pull latest dataset from S3, run LoRA fine-tune, push checkpoint
    subprocess.run([
        "python", "train_lora.py",
        "--model", "mistralai/Mistral-7B-v0.3",
        "--data", "s3://your-bucket/dataset/latest.jsonl",
        "--output", "s3://your-bucket/checkpoints/",
        "--epochs", "1",
    ], check=True)  # check=True raises on non-zero exit — triggers Modal's retry logic

Deploy once — Modal runs it on schedule without any cron daemon or always-on server.


Verification

Check running and past executions:

# List active deployments
modal app list

# Tail logs from the deployed endpoint
modal app logs ml-inference

# Check GPU utilization during a run
modal run inference.py --detach  # Returns immediately, runs async

In the Modal dashboard at modal.com/apps, you'll see per-run GPU seconds, memory peaks, and cost breakdowns per function.

You should see your ml-inference app listed as deployed with the endpoint URL, last invocation timestamp, and cumulative cost.


ProviderA100 80GBA10G 24GBCold StartMin Billing
Modal$1.67/hr ($0.000463/sec)$1.10/hr~2 secPer second
RunPod (Spot)$1.49/hr$0.69/hr3–5 minPer minute
Lambda Labs$2.49/hrN/A5–10 minPer hour
AWS (p4d.xlarge)$3.21/hrN/A5–15 minPer second
Google Colab Pro+N/ASharedVariablePer month

Modal's per-second billing and 2-second cold start make it cheapest for workloads under 45 minutes. For 8-hour training runs, RunPod spot or Lambda Labs becomes more competitive.


What You Learned

  • Modal's @app.function decorator handles GPU provisioning, container builds, and teardown — no YAML, no Kubernetes
  • gpu="A10G" covers 7B models; gpu=modal.gpu.A100(memory=80) handles 70B+ and multi-GPU training
  • .map() parallelizes batch jobs across isolated containers — fastest path for bulk inference
  • @web_endpoint with keep_warm=1 gives you a production inference API at ~$16/month for warm capacity
  • retries=2 on long-running functions protects against spot preemption

Tested on Modal SDK 0.64, Python 3.12, transformers 4.40, PyTorch 2.2.2, Ubuntu 22.04 containers


FAQ

Q: Does Modal support multi-node training across multiple hosts? A: Not natively — Modal's multi-GPU support is single-host (up to 8×A100s on one machine). For true multi-node DDP or FSDP across hosts, use RunPod clusters or AWS SageMaker.

Q: How do I access private Hugging Face models on Modal? A: Store your HF token as a Modal secret: modal secret create huggingface HF_TOKEN=hf_xxx. Reference it with secrets=[modal.Secret.from_name("huggingface")] in @app.function.

Q: What is the maximum timeout for a Modal function? A: 24 hours (timeout=86400). For longer jobs, checkpoint to S3 and restart — Modal doesn't guarantee container continuity beyond 24 hours.

Q: Can Modal functions access AWS S3 or GCP storage? A: Yes. Pass AWS credentials via modal.Secret.from_dotenv() or modal.Secret.from_name(). The container has full network egress, so boto3 and google-cloud-storage work without any special configuration.

Q: Does Modal work with PyTorch 2.x and CUDA 12? A: Yes. Use modal.Image.debian_slim().pip_install("torch==2.2.2+cu121", index_url="https://download.pytorch.org/whl/cu121") to pin the exact CUDA 12.1 build.