Cutting AI Batch Processing Costs 70% with Spot GPU Instances and Checkpointing

How to run large AI batch jobs on spot/preemptible GPU instances — implementing 2-minute checkpoint handlers, job recovery, and cost monitoring across AWS, GCP, and Lambda Labs.

You're paying $3.20/hr for on-demand A10G to run nightly batch embeddings. The same job on spot costs $0.78/hr — and with a 2-minute checkpoint handler, interruptions cost you less than 3 minutes of work.

Your engineering manager sees the 70% savings and starts drafting the email to finance. Then the question hits: "What happens when AWS takes our GPU back mid-batch?" The answer isn't "we lose everything" or "we pay full price." It's a Python handler that listens for termination signals and saves state faster than you can switch Slack channels.

This is the reality of spot GPU instances in 2026: they're not just for experimental workloads anymore. With proper checkpointing, you can run production batch jobs at a fraction of the cost, treating interruptions as a brief pause rather than a catastrophic failure. The infrastructure exists — Modal's cold start for GPU containers averages 2-4s vs Replicate's 8-15s — making recovery nearly instantaneous.

Spot Instance Economics: Where the Real Savings Hide

Let's cut through the marketing: spot GPU instances cost 60-80% less than on-demand according to AWS and Lambda Labs 2025 data. But that headline number ignores the real variable: interruption rates. An instance that's 80% cheaper but gets reclaimed every hour is useless for your 90-minute embedding job.

Here's what providers won't put on their pricing pages:

  • AWS p3/p4 families: ~5% hourly interruption rate in us-east-1 during business hours
  • Lambda Labs: ~1% hourly interruption rate (their smaller market share means less demand pressure)
  • Google Cloud Preemptible VMs: Predictable 24-hour maximum runtime with 30-second warning
  • RunPod: "Spot" is really just lower priority on their physical hardware pool

The economics shift when you factor in checkpointing overhead. If a 5% interruption rate adds 3 minutes of recovery time per hour, your effective cost becomes ($0.78 * 1.05) + (3/60 * $0.78) = $0.84/hr. Still 74% cheaper than on-demand.

Open your terminal and run this to see current spot pricing:


aws ec2 describe-spot-price-history \
  --instance-types g5.48xlarge \
  --product-descriptions "Linux/UNIX" \
  --start-time $(date -u +"%Y-%m-%dT%H:%M:%S" --date="-24 hours") \
  --region us-east-1 \
  --query 'SpotPriceHistory[*].{Time:Timestamp, Price:SpotPrice}' \
  --output table

# Lambda Labs CLI - check GPU availability and spot pricing
lambda labs gpus list --filter="a100|a10g" --spot

You'll notice Lambda Labs often has more consistent availability because they're not battling AWS's internal reservation system. But when AWS does have capacity, their scale drives prices lower.

The 2-Minute Warning: How Clouds Signal Impending Doom

When AWS decides your spot instance is too popular, they don't just yank the cable. They send a polite (but firm) eviction notice through the instance metadata service. Missing this signal is like ignoring a fire alarm because you're wearing headphones.

Here's what happens across providers:

AWS: POST to http://169.254.169.254/latest/meta-data/spot/instance-action returns termination details with 2-minute warning.

Google Cloud: SHUTDOWN signal via ACPI with 30-second warning (much less generous).

Lambda Labs: Webhook to your configured endpoint or message to instance message queue.

The critical implementation detail: you must poll asynchronously. A blocking HTTP request to the metadata service will stall your batch job. Here's the error you'll see if you ignore termination signals:

Spot instance termination notice - fix: implement 2-minute checkpoint handler via instance metadata endpoint

Implementing a Non-Blocking Checkpoint Handler

Open VS Code (Ctrl+for terminal) and createcheckpoint_handler.py`. This runs alongside your batch job, not in your main processing thread:

import asyncio
import aiohttp
import signal
import json
import logging
from datetime import datetime, timedelta
import subprocess
import sys

class SpotCheckpointHandler:
    def __init__(self, checkpoint_interval=300, metadata_url="http://169.254.169.254"):
        self.metadata_url = metadata_url
        self.checkpoint_interval = checkpoint_interval
        self.last_checkpoint = None
        self.termination_notice = False
        self.logger = logging.getLogger(__name__)
        
    async def poll_termination(self):
        """Async poll for spot termination notices"""
        while not self.termination_notice:
            try:
                async with aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=2)) as session:
                    async with session.get(f"{self.metadata_url}/latest/meta-data/spot/instance-action") as resp:
                        if resp.status == 200:
                            data = await resp.json()
                            self.termination_notice = True
                            self.logger.warning(f"Termination notice received: {data}")
                            await self.emergency_checkpoint()
                            return
            except aiohttp.ClientError:
                # No termination notice yet
                pass
            except Exception as e:
                self.logger.error(f"Error polling metadata: {e}")
            
            await asyncio.sleep(5)  # Poll every 5 seconds
    
    async def periodic_checkpoint(self, checkpoint_func):
        """Regular checkpointing independent of termination signals"""
        while not self.termination_notice:
            await asyncio.sleep(self.checkpoint_interval)
            if not self.termination_notice:
                await checkpoint_func()
                self.last_checkpoint = datetime.utcnow()
    
    async def emergency_checkpoint(self):
        """Execute when termination notice received"""
        self.logger.info("Initiating emergency checkpoint")
        
        # Signal main process to checkpoint via file
        with open("/tmp/checkpoint_now", "w") as f:
            f.write(datetime.utcnow().isoformat())
        
        # Give main process 110 seconds to checkpoint (10 second buffer)
        await asyncio.sleep(110)
        
        self.logger.info("Checkpoint window complete - instance can terminate")
    
    async def run(self, checkpoint_func):
        """Main handler loop"""
        tasks = [
            asyncio.create_task(self.poll_termination()),
            asyncio.create_task(self.periodic_checkpoint(checkpoint_func))
        ]
        
        await asyncio.gather(*tasks)

# Example integration with your batch job
async def save_checkpoint():
    """Your actual checkpoint logic"""
    # Save model state, batch progress, etc.
    print(f"Checkpoint saved at {datetime.utcnow()}")
    
    # Example: Save to S3
    # await s3_client.put_object(Bucket='checkpoints', Key=f"batch_{batch_id}.json", Body=state_json)

async def main_batch_job():
    handler = SpotCheckpointHandler()
    
    # Run handler alongside your processing
    handler_task = asyncio.create_task(handler.run(save_checkpoint))
    
    # Your actual batch processing here
    try:
        for i in range(1000):
            # Process batch
            await asyncio.sleep(0.1)  # Simulated work
            
            # Check for emergency checkpoint signal
            try:
                with open("/tmp/checkpoint_now", "r") as f:
                    print("Emergency checkpoint requested - saving state")
                    await save_checkpoint()
                    subprocess.run(["rm", "/tmp/checkpoint_now"])
            except FileNotFoundError:
                pass
    finally:
        handler.termination_notice = True
        await handler_task

if __name__ == "__main__":
    asyncio.run(main_batch_job())

This pattern gives you 110 seconds to save state after the termination notice. The key is the non-blocking poll — your batch job continues while the handler watches for eviction.

Job State Persistence: Where to Stash Your Checkpoints

Checkpointing is useless if your storage can't keep up. Saving 16GB of model weights to S3 during a 2-minute warning requires serious throughput. Let's compare options:

Local NVMe: Sequential read speeds (7GB/s) load 70B models 4x faster than SATA SSD (1.5GB/s). Perfect for quick saves, but ephemeral — you must replicate to persistent storage.

S3: Durable but slower. A multi-part upload of 16GB takes ~45 seconds with good network.

Redis: Fast for metadata (batch position, job IDs) but terrible for model weights.

EBS gp3: 1GB/s throughput, persistent, but attached to the dying instance.

The winning strategy: tiered storage. Save model weights to local NVMe first (18s for a 70B model), then async replicate to S3 while continuing processing. Save metadata to Redis for instant recovery.

Here's the error you'll hit with poor storage planning:

Error: failed to pull model, disk quota exceeded — fix: mount NVMe volume, set OLLAMA_MODELS=/mnt/nvme/models

Implement it like this in your Dockerfile or Modal setup:

FROM nvidia/cuda:12.1-base

# Mount NVMe volume for model storage
VOLUME /mnt/nvme/models

# Set Ollama to use NVMe
ENV OLLAMA_MODELS=/mnt/nvme/models
ENV OLLAMA_KEEP_ALIVE=-1

# Install Ollama
RUN curl -fsSL https://ollama.com/install.sh | sh

# Your checkpoint script
COPY checkpoint_handler.py /app/
COPY batch_processor.py /app/

CMD ["python", "/app/batch_processor.py"]

On Modal, you'd configure it like this:

import modal

app = modal.App("spot-batch")

# Attach NVMe volume for fast checkpointing
nvme_volume = modal.Volume.from_name("checkpoint-volume", create_if_missing=True)

@app.function(
    gpu="a100",
    volumes={"/mnt/nvme": nvme_volume},
    timeout=3600,
    # Enable spot instances
    cloud="aws",
    spot=True,
)
def run_batch():
    # Your batch job with checkpointing
    pass

Automatic Job Recovery: The Phoenix Pattern

A checkpoint is only valuable if you can resume from it. When your spot instance dies and a new one spins up, your system should automatically:

  1. Pull the latest checkpoint from S3
  2. Load model weights from NVMe (or S3 if not cached)
  3. Resume from the exact batch position in Redis
  4. Continue processing as if nothing happened

Here's the recovery script:

import boto3
import redis
import json
import asyncio
from pathlib import Path

class JobRecovery:
    def __init__(self, job_id, redis_host="localhost"):
        self.job_id = job_id
        self.redis = redis.Redis(host=redis_host, port=6379, decode_responses=True)
        self.s3 = boto3.client('s3')
        
    async def recover(self):
        # 1. Get job metadata from Redis
        metadata = self.redis.get(f"job:{self.job_id}:metadata")
        if not metadata:
            raise ValueError(f"No metadata found for job {self.job_id}")
        
        metadata = json.loads(metadata)
        
        # 2. Download checkpoint from S3 if not in NVMe cache
        checkpoint_path = Path(f"/mnt/nvme/checkpoints/{self.job_id}.ckpt")
        
        if not checkpoint_path.exists():
            await self._download_from_s3(
                bucket=metadata['checkpoint_bucket'],
                key=metadata['checkpoint_key'],
                local_path=checkpoint_path
            )
        
        # 3. Load batch position
        batch_position = int(self.redis.get(f"job:{self.job_id}:batch_position") or 0)
        
        # 4. Verify model is loaded
        model_path = Path(metadata['model_path'])
        if not model_path.exists():
            # Load from S3 or model registry
            await self._load_model(metadata['model_s3_uri'])
        
        return {
            'checkpoint_path': str(checkpoint_path),
            'batch_position': batch_position,
            'total_batches': metadata['total_batches'],
            'model_loaded': True
        }
    
    async def _download_from_s3(self, bucket, key, local_path):
        # Async download with progress
        loop = asyncio.get_event_loop()
        await loop.run_in_executor(
            None,
            lambda: self.s3.download_file(bucket, key, str(local_path))
        )
    
    async def _load_model(self, model_s3_uri):
        # Implementation depends on your model serving setup
        # For Ollama:
        # subprocess.run(["ollama", "pull", model_name])
        pass

# Usage in your main batch script
async def main():
    job_id = os.getenv('JOB_ID')
    
    recovery = JobRecovery(job_id)
    state = await recovery.recover()
    
    print(f"Resuming from batch {state['batch_position']} of {state['total_batches']}")
    # Continue processing...

The recovery time dominates your cost equation. If it takes 5 minutes to resume, those are 5 minutes of GPU time wasted. This is where NVMe pays for itself — 18s vs 74s model load time for 70B parameters.

Benchmark: Interruption-Adjusted Cost Analysis

Let's get concrete with numbers. Assume a nightly batch job: 10 million embeddings using sentence-transformers on A10G.

ProviderInstance TypeOn-Demand $/hrSpot $/hrInt. RateCheckpoint OverheadEffective $/hrCost per 1M embeddings
AWSg5.48xlarge$3.20$0.785%/hr3 min$0.84$2.81
Lambda LabsA100 80GB$4.10$1.151%/hr3 min$1.18$3.94
ModalA100$3.80$1.05~3%/hr2.4s cold start$1.08$3.61
On-Demand Baselineg5.48xlarge$3.20-0%0$3.20$10.67

Assumptions: 3M embeddings/hour on A10G, 2-minute checkpoint save time, 18s model reload on NVMe

The math reveals the truth: AWS spot wins on pure cost, but Lambda Labs' lower interruption rate might be worth the premium for time-sensitive jobs. Modal's 2.4s cold start for Llama 3 8B (measured Q1 2026) makes it ideal for stateless batch operations where you can afford to lose an instance and instantly respawn.

But here's what the table doesn't show: Kubernetes GPU scheduling overhead adds 200-400ms per pod launch. If your batch job spawns many short-lived pods, that overhead eats into savings. This is why serverless GPU platforms like Modal shine for certain patterns — they absorb the scheduling cost.

Multi-Provider Strategy: Playing the Spot Market

Putting all your spot instances in one cloud is like buying all your stocks from one company. When AWS spot prices spike or capacity vanishes, you need alternatives.

Implement a provider router that:

  1. Checks spot prices across AWS, Lambda Labs, and RunPod
  2. Launches instances where price/availability is optimal
  3. Routes jobs to available capacity
  4. Maintains checkpoint compatibility across providers
import boto3
from lambda_labs import LambdaLabs
import runpod
from dataclasses import dataclass
from typing import Optional

@dataclass
class SpotInstance:
    provider: str
    instance_type: str
    hourly_price: float
    availability_score: float  # 0-1 based on historical interruptions
    max_bid: float  # Your maximum bid price

class SpotMarketRouter:
    def __init__(self):
        self.aws_ec2 = boto3.client('ec2')
        self.lambda_labs = LambdaLabs(api_key=os.getenv('LAMBDA_API_KEY'))
        self.runpod = runpod.RunPod(api_key=os.getenv('RUNPOD_API_KEY'))
    
    def get_best_instance(self, gpu_type="a10g", min_memory=24):
        """Return the best available spot instance across providers"""
        instances = []
        
        # Check AWS
        aws_instances = self._check_aws_spot(gpu_type, min_memory)
        instances.extend(aws_instances)
        
        # Check Lambda Labs
        lambda_instances = self._check_lambda_labs(gpu_type, min_memory)
        instances.extend(lambda_instances)
        
        # Check RunPod
        runpod_instances = self._check_runpod(gpu_type, min_memory)
        instances.extend(runpod_instances)
        
        if not instances:
            raise Exception("No spot instances available across providers")
        
        # Score each instance: lower price = better, higher availability = better
        scored = []
        for inst in instances:
            score = (1 / inst.hourly_price) * inst.availability_score
            scored.append((score, inst))
        
        # Return best scoring instance
        scored.sort(reverse=True)
        return scored[0][1]
    
    def _check_aws_spot(self, gpu_type, min_memory):
        # Implementation checks AWS spot prices and availability
        # Returns list of SpotInstance objects
        pass
    
    def _check_lambda_labs(self, gpu_type, min_memory):
        # Lambda Labs API call
        pass
    
    def _check_runpod(self, gpu_type, min_memory):
        # RunPod API call
        pass
    
    def launch_instance(self, spot_instance: SpotInstance):
        """Launch instance on selected provider"""
        if spot_instance.provider == "aws":
            return self._launch_aws_instance(spot_instance)
        elif spot_instance.provider == "lambda":
            return self._launch_lambda_instance(spot_instance)
        elif spot_instance.provider == "runpod":
            return self._launch_runpod_instance(spot_instance)
    
    def _launch_aws_instance(self, spot_instance):
        # Launch AWS spot instance with checkpointing enabled
        response = self.aws_ec2.request_spot_instances(
            InstanceCount=1,
            LaunchSpecification={
                'ImageId': 'ami-0c55b159cbfafe1f0',
                'InstanceType': spot_instance.instance_type,
                'KeyName': 'spot-keypair',
                'BlockDeviceMappings': [{
                    'DeviceName': '/dev/sda1',
                    'Ebs': {
                        'VolumeSize': 500,  # NVMe for checkpointing
                        'VolumeType': 'gp3'
                    }
                }]
            },
            SpotPrice=str(spot_instance.max_bid),
            Type='persistent'
        )
        return response['SpotInstanceRequests'][0]['SpotInstanceRequestId']

# Usage
router = SpotMarketRouter()
best_instance = router.get_best_instance(gpu_type="a100", min_memory=40)

if best_instance.hourly_price > 2.00:  # Your threshold
    print("Spot prices too high - falling back to Lambda Labs on-demand")
    best_instance = SpotInstance(
        provider="lambda",
        instance_type="a100",
        hourly_price=4.10,
        availability_score=1.0,
        max_bid=4.10
    )

instance_id = router.launch_instance(best_instance)

This multi-provider approach ensures you always have GPU capacity, even during regional outages or price spikes. The checkpointing system makes instances fungible — any A100 with NVMe can resume your job.

The Real Cost: Monitoring and Orchestration Overhead

Before you migrate everything to spot, acknowledge the hidden costs:

  1. Monitoring complexity: You need Prometheus/Grafana tracking interruption rates, checkpoint success/failure, and effective cost savings. A 15s scrape interval adds <0.1% CPU overhead but requires setup.

  2. Orchestration: Kubernetes GPU scheduling overhead adds 200-400ms per pod launch. For thousands of short batch jobs, this matters.

  3. Development time: Writing robust checkpointing isn't trivial. Test interruption scenarios with chaos engineering.

  4. Storage costs: NVMe is expensive. S3 egress adds up. Redis clusters aren't free.

The break-even point depends on your batch job duration and frequency. For nightly jobs over 2 hours, spot almost always wins. For 10-minute jobs, the overhead might negate savings.

Next Steps: Implementing Your Spot Migration

Start small. Don't migrate your entire training pipeline on day one:

  1. Instrument one batch job with the checkpoint handler above. Run it on spot alongside your on-demand version for comparison.

  2. Set up monitoring with these key metrics:

    • Spot interruption rate by provider/region
    • Checkpoint save/load latency
    • Effective cost per million embeddings/tokens
    • Job completion rate (with/without spot)
  3. Implement chaos testing: Randomly terminate instances during development to verify recovery works. Use Kubernetes pod disruption budgets or AWS Fault Injection Simulator.

  4. Create rollback procedures: When spot prices spike 10x (it happens), automatically route to on-demand with alerting.

  5. Negotiate with providers: Once you're spending thousands monthly on spot, talk to Lambda Labs or AWS about reserved spot capacity or custom pricing.

The final truth about spot GPU instances: they're not unreliable, they're differently reliable. With checkpointing, you're not avoiding interruptions — you're making them cost less than the savings. That 70% discount becomes real when your batch job pauses for 3 minutes instead of starting over.

Your A10G at $0.78/hr is waiting. It might get taken back tomorrow, but your checkpoint handler will be ready.