You're paying $3.20/hr for on-demand A10G to run nightly batch embeddings. The same job on spot costs $0.78/hr — and with a 2-minute checkpoint handler, interruptions cost you less than 3 minutes of work.
Your engineering manager sees the 70% savings and starts drafting the email to finance. Then the question hits: "What happens when AWS takes our GPU back mid-batch?" The answer isn't "we lose everything" or "we pay full price." It's a Python handler that listens for termination signals and saves state faster than you can switch Slack channels.
This is the reality of spot GPU instances in 2026: they're not just for experimental workloads anymore. With proper checkpointing, you can run production batch jobs at a fraction of the cost, treating interruptions as a brief pause rather than a catastrophic failure. The infrastructure exists — Modal's cold start for GPU containers averages 2-4s vs Replicate's 8-15s — making recovery nearly instantaneous.
Spot Instance Economics: Where the Real Savings Hide
Let's cut through the marketing: spot GPU instances cost 60-80% less than on-demand according to AWS and Lambda Labs 2025 data. But that headline number ignores the real variable: interruption rates. An instance that's 80% cheaper but gets reclaimed every hour is useless for your 90-minute embedding job.
Here's what providers won't put on their pricing pages:
- AWS p3/p4 families: ~5% hourly interruption rate in us-east-1 during business hours
- Lambda Labs: ~1% hourly interruption rate (their smaller market share means less demand pressure)
- Google Cloud Preemptible VMs: Predictable 24-hour maximum runtime with 30-second warning
- RunPod: "Spot" is really just lower priority on their physical hardware pool
The economics shift when you factor in checkpointing overhead. If a 5% interruption rate adds 3 minutes of recovery time per hour, your effective cost becomes ($0.78 * 1.05) + (3/60 * $0.78) = $0.84/hr. Still 74% cheaper than on-demand.
Open your terminal and run this to see current spot pricing:
aws ec2 describe-spot-price-history \
--instance-types g5.48xlarge \
--product-descriptions "Linux/UNIX" \
--start-time $(date -u +"%Y-%m-%dT%H:%M:%S" --date="-24 hours") \
--region us-east-1 \
--query 'SpotPriceHistory[*].{Time:Timestamp, Price:SpotPrice}' \
--output table
# Lambda Labs CLI - check GPU availability and spot pricing
lambda labs gpus list --filter="a100|a10g" --spot
You'll notice Lambda Labs often has more consistent availability because they're not battling AWS's internal reservation system. But when AWS does have capacity, their scale drives prices lower.
The 2-Minute Warning: How Clouds Signal Impending Doom
When AWS decides your spot instance is too popular, they don't just yank the cable. They send a polite (but firm) eviction notice through the instance metadata service. Missing this signal is like ignoring a fire alarm because you're wearing headphones.
Here's what happens across providers:
AWS: POST to http://169.254.169.254/latest/meta-data/spot/instance-action returns termination details with 2-minute warning.
Google Cloud: SHUTDOWN signal via ACPI with 30-second warning (much less generous).
Lambda Labs: Webhook to your configured endpoint or message to instance message queue.
The critical implementation detail: you must poll asynchronously. A blocking HTTP request to the metadata service will stall your batch job. Here's the error you'll see if you ignore termination signals:
Spot instance termination notice - fix: implement 2-minute checkpoint handler via instance metadata endpoint
Implementing a Non-Blocking Checkpoint Handler
Open VS Code (Ctrl+for terminal) and createcheckpoint_handler.py`. This runs alongside your batch job, not in your main processing thread:
import asyncio
import aiohttp
import signal
import json
import logging
from datetime import datetime, timedelta
import subprocess
import sys
class SpotCheckpointHandler:
def __init__(self, checkpoint_interval=300, metadata_url="http://169.254.169.254"):
self.metadata_url = metadata_url
self.checkpoint_interval = checkpoint_interval
self.last_checkpoint = None
self.termination_notice = False
self.logger = logging.getLogger(__name__)
async def poll_termination(self):
"""Async poll for spot termination notices"""
while not self.termination_notice:
try:
async with aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=2)) as session:
async with session.get(f"{self.metadata_url}/latest/meta-data/spot/instance-action") as resp:
if resp.status == 200:
data = await resp.json()
self.termination_notice = True
self.logger.warning(f"Termination notice received: {data}")
await self.emergency_checkpoint()
return
except aiohttp.ClientError:
# No termination notice yet
pass
except Exception as e:
self.logger.error(f"Error polling metadata: {e}")
await asyncio.sleep(5) # Poll every 5 seconds
async def periodic_checkpoint(self, checkpoint_func):
"""Regular checkpointing independent of termination signals"""
while not self.termination_notice:
await asyncio.sleep(self.checkpoint_interval)
if not self.termination_notice:
await checkpoint_func()
self.last_checkpoint = datetime.utcnow()
async def emergency_checkpoint(self):
"""Execute when termination notice received"""
self.logger.info("Initiating emergency checkpoint")
# Signal main process to checkpoint via file
with open("/tmp/checkpoint_now", "w") as f:
f.write(datetime.utcnow().isoformat())
# Give main process 110 seconds to checkpoint (10 second buffer)
await asyncio.sleep(110)
self.logger.info("Checkpoint window complete - instance can terminate")
async def run(self, checkpoint_func):
"""Main handler loop"""
tasks = [
asyncio.create_task(self.poll_termination()),
asyncio.create_task(self.periodic_checkpoint(checkpoint_func))
]
await asyncio.gather(*tasks)
# Example integration with your batch job
async def save_checkpoint():
"""Your actual checkpoint logic"""
# Save model state, batch progress, etc.
print(f"Checkpoint saved at {datetime.utcnow()}")
# Example: Save to S3
# await s3_client.put_object(Bucket='checkpoints', Key=f"batch_{batch_id}.json", Body=state_json)
async def main_batch_job():
handler = SpotCheckpointHandler()
# Run handler alongside your processing
handler_task = asyncio.create_task(handler.run(save_checkpoint))
# Your actual batch processing here
try:
for i in range(1000):
# Process batch
await asyncio.sleep(0.1) # Simulated work
# Check for emergency checkpoint signal
try:
with open("/tmp/checkpoint_now", "r") as f:
print("Emergency checkpoint requested - saving state")
await save_checkpoint()
subprocess.run(["rm", "/tmp/checkpoint_now"])
except FileNotFoundError:
pass
finally:
handler.termination_notice = True
await handler_task
if __name__ == "__main__":
asyncio.run(main_batch_job())
This pattern gives you 110 seconds to save state after the termination notice. The key is the non-blocking poll — your batch job continues while the handler watches for eviction.
Job State Persistence: Where to Stash Your Checkpoints
Checkpointing is useless if your storage can't keep up. Saving 16GB of model weights to S3 during a 2-minute warning requires serious throughput. Let's compare options:
Local NVMe: Sequential read speeds (7GB/s) load 70B models 4x faster than SATA SSD (1.5GB/s). Perfect for quick saves, but ephemeral — you must replicate to persistent storage.
S3: Durable but slower. A multi-part upload of 16GB takes ~45 seconds with good network.
Redis: Fast for metadata (batch position, job IDs) but terrible for model weights.
EBS gp3: 1GB/s throughput, persistent, but attached to the dying instance.
The winning strategy: tiered storage. Save model weights to local NVMe first (18s for a 70B model), then async replicate to S3 while continuing processing. Save metadata to Redis for instant recovery.
Here's the error you'll hit with poor storage planning:
Error: failed to pull model, disk quota exceeded — fix: mount NVMe volume, set OLLAMA_MODELS=/mnt/nvme/models
Implement it like this in your Dockerfile or Modal setup:
FROM nvidia/cuda:12.1-base
# Mount NVMe volume for model storage
VOLUME /mnt/nvme/models
# Set Ollama to use NVMe
ENV OLLAMA_MODELS=/mnt/nvme/models
ENV OLLAMA_KEEP_ALIVE=-1
# Install Ollama
RUN curl -fsSL https://ollama.com/install.sh | sh
# Your checkpoint script
COPY checkpoint_handler.py /app/
COPY batch_processor.py /app/
CMD ["python", "/app/batch_processor.py"]
On Modal, you'd configure it like this:
import modal
app = modal.App("spot-batch")
# Attach NVMe volume for fast checkpointing
nvme_volume = modal.Volume.from_name("checkpoint-volume", create_if_missing=True)
@app.function(
gpu="a100",
volumes={"/mnt/nvme": nvme_volume},
timeout=3600,
# Enable spot instances
cloud="aws",
spot=True,
)
def run_batch():
# Your batch job with checkpointing
pass
Automatic Job Recovery: The Phoenix Pattern
A checkpoint is only valuable if you can resume from it. When your spot instance dies and a new one spins up, your system should automatically:
- Pull the latest checkpoint from S3
- Load model weights from NVMe (or S3 if not cached)
- Resume from the exact batch position in Redis
- Continue processing as if nothing happened
Here's the recovery script:
import boto3
import redis
import json
import asyncio
from pathlib import Path
class JobRecovery:
def __init__(self, job_id, redis_host="localhost"):
self.job_id = job_id
self.redis = redis.Redis(host=redis_host, port=6379, decode_responses=True)
self.s3 = boto3.client('s3')
async def recover(self):
# 1. Get job metadata from Redis
metadata = self.redis.get(f"job:{self.job_id}:metadata")
if not metadata:
raise ValueError(f"No metadata found for job {self.job_id}")
metadata = json.loads(metadata)
# 2. Download checkpoint from S3 if not in NVMe cache
checkpoint_path = Path(f"/mnt/nvme/checkpoints/{self.job_id}.ckpt")
if not checkpoint_path.exists():
await self._download_from_s3(
bucket=metadata['checkpoint_bucket'],
key=metadata['checkpoint_key'],
local_path=checkpoint_path
)
# 3. Load batch position
batch_position = int(self.redis.get(f"job:{self.job_id}:batch_position") or 0)
# 4. Verify model is loaded
model_path = Path(metadata['model_path'])
if not model_path.exists():
# Load from S3 or model registry
await self._load_model(metadata['model_s3_uri'])
return {
'checkpoint_path': str(checkpoint_path),
'batch_position': batch_position,
'total_batches': metadata['total_batches'],
'model_loaded': True
}
async def _download_from_s3(self, bucket, key, local_path):
# Async download with progress
loop = asyncio.get_event_loop()
await loop.run_in_executor(
None,
lambda: self.s3.download_file(bucket, key, str(local_path))
)
async def _load_model(self, model_s3_uri):
# Implementation depends on your model serving setup
# For Ollama:
# subprocess.run(["ollama", "pull", model_name])
pass
# Usage in your main batch script
async def main():
job_id = os.getenv('JOB_ID')
recovery = JobRecovery(job_id)
state = await recovery.recover()
print(f"Resuming from batch {state['batch_position']} of {state['total_batches']}")
# Continue processing...
The recovery time dominates your cost equation. If it takes 5 minutes to resume, those are 5 minutes of GPU time wasted. This is where NVMe pays for itself — 18s vs 74s model load time for 70B parameters.
Benchmark: Interruption-Adjusted Cost Analysis
Let's get concrete with numbers. Assume a nightly batch job: 10 million embeddings using sentence-transformers on A10G.
| Provider | Instance Type | On-Demand $/hr | Spot $/hr | Int. Rate | Checkpoint Overhead | Effective $/hr | Cost per 1M embeddings |
|---|---|---|---|---|---|---|---|
| AWS | g5.48xlarge | $3.20 | $0.78 | 5%/hr | 3 min | $0.84 | $2.81 |
| Lambda Labs | A100 80GB | $4.10 | $1.15 | 1%/hr | 3 min | $1.18 | $3.94 |
| Modal | A100 | $3.80 | $1.05 | ~3%/hr | 2.4s cold start | $1.08 | $3.61 |
| On-Demand Baseline | g5.48xlarge | $3.20 | - | 0% | 0 | $3.20 | $10.67 |
Assumptions: 3M embeddings/hour on A10G, 2-minute checkpoint save time, 18s model reload on NVMe
The math reveals the truth: AWS spot wins on pure cost, but Lambda Labs' lower interruption rate might be worth the premium for time-sensitive jobs. Modal's 2.4s cold start for Llama 3 8B (measured Q1 2026) makes it ideal for stateless batch operations where you can afford to lose an instance and instantly respawn.
But here's what the table doesn't show: Kubernetes GPU scheduling overhead adds 200-400ms per pod launch. If your batch job spawns many short-lived pods, that overhead eats into savings. This is why serverless GPU platforms like Modal shine for certain patterns — they absorb the scheduling cost.
Multi-Provider Strategy: Playing the Spot Market
Putting all your spot instances in one cloud is like buying all your stocks from one company. When AWS spot prices spike or capacity vanishes, you need alternatives.
Implement a provider router that:
- Checks spot prices across AWS, Lambda Labs, and RunPod
- Launches instances where price/availability is optimal
- Routes jobs to available capacity
- Maintains checkpoint compatibility across providers
import boto3
from lambda_labs import LambdaLabs
import runpod
from dataclasses import dataclass
from typing import Optional
@dataclass
class SpotInstance:
provider: str
instance_type: str
hourly_price: float
availability_score: float # 0-1 based on historical interruptions
max_bid: float # Your maximum bid price
class SpotMarketRouter:
def __init__(self):
self.aws_ec2 = boto3.client('ec2')
self.lambda_labs = LambdaLabs(api_key=os.getenv('LAMBDA_API_KEY'))
self.runpod = runpod.RunPod(api_key=os.getenv('RUNPOD_API_KEY'))
def get_best_instance(self, gpu_type="a10g", min_memory=24):
"""Return the best available spot instance across providers"""
instances = []
# Check AWS
aws_instances = self._check_aws_spot(gpu_type, min_memory)
instances.extend(aws_instances)
# Check Lambda Labs
lambda_instances = self._check_lambda_labs(gpu_type, min_memory)
instances.extend(lambda_instances)
# Check RunPod
runpod_instances = self._check_runpod(gpu_type, min_memory)
instances.extend(runpod_instances)
if not instances:
raise Exception("No spot instances available across providers")
# Score each instance: lower price = better, higher availability = better
scored = []
for inst in instances:
score = (1 / inst.hourly_price) * inst.availability_score
scored.append((score, inst))
# Return best scoring instance
scored.sort(reverse=True)
return scored[0][1]
def _check_aws_spot(self, gpu_type, min_memory):
# Implementation checks AWS spot prices and availability
# Returns list of SpotInstance objects
pass
def _check_lambda_labs(self, gpu_type, min_memory):
# Lambda Labs API call
pass
def _check_runpod(self, gpu_type, min_memory):
# RunPod API call
pass
def launch_instance(self, spot_instance: SpotInstance):
"""Launch instance on selected provider"""
if spot_instance.provider == "aws":
return self._launch_aws_instance(spot_instance)
elif spot_instance.provider == "lambda":
return self._launch_lambda_instance(spot_instance)
elif spot_instance.provider == "runpod":
return self._launch_runpod_instance(spot_instance)
def _launch_aws_instance(self, spot_instance):
# Launch AWS spot instance with checkpointing enabled
response = self.aws_ec2.request_spot_instances(
InstanceCount=1,
LaunchSpecification={
'ImageId': 'ami-0c55b159cbfafe1f0',
'InstanceType': spot_instance.instance_type,
'KeyName': 'spot-keypair',
'BlockDeviceMappings': [{
'DeviceName': '/dev/sda1',
'Ebs': {
'VolumeSize': 500, # NVMe for checkpointing
'VolumeType': 'gp3'
}
}]
},
SpotPrice=str(spot_instance.max_bid),
Type='persistent'
)
return response['SpotInstanceRequests'][0]['SpotInstanceRequestId']
# Usage
router = SpotMarketRouter()
best_instance = router.get_best_instance(gpu_type="a100", min_memory=40)
if best_instance.hourly_price > 2.00: # Your threshold
print("Spot prices too high - falling back to Lambda Labs on-demand")
best_instance = SpotInstance(
provider="lambda",
instance_type="a100",
hourly_price=4.10,
availability_score=1.0,
max_bid=4.10
)
instance_id = router.launch_instance(best_instance)
This multi-provider approach ensures you always have GPU capacity, even during regional outages or price spikes. The checkpointing system makes instances fungible — any A100 with NVMe can resume your job.
The Real Cost: Monitoring and Orchestration Overhead
Before you migrate everything to spot, acknowledge the hidden costs:
Monitoring complexity: You need Prometheus/Grafana tracking interruption rates, checkpoint success/failure, and effective cost savings. A 15s scrape interval adds <0.1% CPU overhead but requires setup.
Orchestration: Kubernetes GPU scheduling overhead adds 200-400ms per pod launch. For thousands of short batch jobs, this matters.
Development time: Writing robust checkpointing isn't trivial. Test interruption scenarios with chaos engineering.
Storage costs: NVMe is expensive. S3 egress adds up. Redis clusters aren't free.
The break-even point depends on your batch job duration and frequency. For nightly jobs over 2 hours, spot almost always wins. For 10-minute jobs, the overhead might negate savings.
Next Steps: Implementing Your Spot Migration
Start small. Don't migrate your entire training pipeline on day one:
Instrument one batch job with the checkpoint handler above. Run it on spot alongside your on-demand version for comparison.
Set up monitoring with these key metrics:
- Spot interruption rate by provider/region
- Checkpoint save/load latency
- Effective cost per million embeddings/tokens
- Job completion rate (with/without spot)
Implement chaos testing: Randomly terminate instances during development to verify recovery works. Use Kubernetes pod disruption budgets or AWS Fault Injection Simulator.
Create rollback procedures: When spot prices spike 10x (it happens), automatically route to on-demand with alerting.
Negotiate with providers: Once you're spending thousands monthly on spot, talk to Lambda Labs or AWS about reserved spot capacity or custom pricing.
The final truth about spot GPU instances: they're not unreliable, they're differently reliable. With checkpointing, you're not avoiding interruptions — you're making them cost less than the savings. That 70% discount becomes real when your batch job pauses for 3 minutes instead of starting over.
Your A10G at $0.78/hr is waiting. It might get taken back tomorrow, but your checkpoint handler will be ready.