Problem: GPU Instances Are Expensive and Hard to Right-Size
You spun up a GPU instance for inference. It idles at 3% utilization 22 hours a day — costing $600/month — then falls over under load the other 2 hours.
Manual scaling doesn't work. Setting a fixed cluster size means you're either wasting money or dropping requests. You need infrastructure that scales with real demand and refuses to spend past a defined budget.
You'll learn:
- How to provision GPU instances on AWS (g5) and GCP (a2) using Terraform modules
- Auto Scaling Group configuration tied to custom inference queue metrics
- Budget alerts and hard-cutoff cost guards using AWS Budgets + Lambda and GCP Budget API
- How to structure Terraform workspaces so dev, staging, and prod have different scale limits
Time: 40 min | Difficulty: Advanced
Why Standard Autoscaling Doesn't Work for AI Workloads
EC2 Auto Scaling and GCP Managed Instance Groups scale on CPU and memory by default. GPU inference workloads break both assumptions:
- GPU utilization is binary. A model server at 5% GPU utilization and one at 95% both look idle to the CPU metric — until the 95% one starts queuing and crashing.
- Cold starts are expensive. A g5.xlarge takes 4–6 minutes to boot, pull the model, and serve its first request. Standard reactive scaling is too slow.
- Spot/preemptible interruptions corrupt in-flight requests. You need a drain-and-replace strategy, not a kill-and-replace one.
The solution: scale on inference queue depth (SQS or Pub/Sub message count), not on instance metrics. Add a budget guard that scales the group to zero if spend crosses a threshold.
Architecture Overview
Request traffic
│
▼
Load Balancer (ALB / Cloud Load Balancing)
│
▼
Inference Queue (SQS / Pub/Sub) ◀── queue depth metric
│ │
▼ ▼
Model Server ASG / MIG Scaling Policy
(g5.xlarge / a2-highgpu-1g) │
│ ▼
▼ Budget Guard Lambda / CF
S3 / GCS model cache (scale-to-zero if over budget)
Terraform manages every layer. State lives in S3 + DynamoDB (AWS) or GCS (GCP) with locking.
Solution
Step 1: Set Up Terraform Workspaces and Remote State
Separate workspaces prevent a terraform apply in dev from touching prod GPU instances.
# Initialize with S3 backend (AWS example)
terraform init \
-backend-config="bucket=my-tf-state-2026" \
-backend-config="key=ai-infra/terraform.tfstate" \
-backend-config="region=us-east-1" \
-backend-config="dynamodb_table=tf-state-lock"
# Create isolated workspaces
terraform workspace new dev
terraform workspace new staging
terraform workspace new prod
Your backend.tf:
terraform {
required_version = ">= 1.7.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.40"
}
}
backend "s3" {
# Values injected via -backend-config at init time
# Never hardcode bucket names in version-controlled code
}
}
Your variables.tf — workspace-aware GPU limits:
variable "gpu_instance_type" {
description = "EC2 instance type for inference nodes"
type = string
default = "g5.xlarge" # 24GB A10G; use g5.2xlarge for 70B models
}
variable "max_gpu_instances" {
description = "Hard ceiling on GPU node count — enforced per workspace"
type = number
# dev=1, staging=2, prod=10 — set via workspace-specific tfvars
}
variable "monthly_budget_usd" {
description = "Hard budget limit in USD; triggers scale-to-zero at threshold"
type = number
}
variable "scale_in_cooldown_seconds" {
description = "Wait this long before removing a node — lets in-flight requests drain"
type = number
default = 300 # 5 min; long enough for a p99 inference call to finish
}
Create per-workspace var files:
# dev.tfvars
max_gpu_instances = 1
monthly_budget_usd = 150
# staging.tfvars
max_gpu_instances = 2
monthly_budget_usd = 400
# prod.tfvars
max_gpu_instances = 10
monthly_budget_usd = 3000
Apply with:
terraform workspace select prod
terraform apply -var-file=prod.tfvars
Step 2: Provision the GPU Launch Template and ASG
# launch_template.tf
data "aws_ami" "gpu_base" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["Deep Learning Base OSS Nvidia Driver GPU AMI (Amazon Linux 2) *"]
}
}
resource "aws_launch_template" "inference" {
name_prefix = "inference-${terraform.workspace}-"
image_id = data.aws_ami.gpu_base.id
instance_type = var.gpu_instance_type
# Spot instances cut g5.xlarge cost by ~70% vs on-demand
# instance_market_options enables spot with on-demand fallback via ASG
instance_market_options {
market_type = "spot"
spot_options {
# max_price left unset = never pay more than on-demand rate
instance_interruption_behavior = "terminate"
}
}
iam_instance_profile {
name = aws_iam_instance_profile.inference.name
}
network_interfaces {
associate_public_ip_address = false
security_groups = [aws_security_group.inference.id]
}
block_device_mappings {
device_name = "/dev/xvda"
ebs {
volume_size = 100 # GB; enough for a 70B Q4 model
volume_type = "gp3"
iops = 4000
delete_on_termination = true
}
}
user_data = base64encode(templatefile("${path.module}/scripts/bootstrap.sh", {
model_s3_path = var.model_s3_path
sqs_queue_url = aws_sqs_queue.inference.url
workspace = terraform.workspace
}))
tag_specifications {
resource_type = "instance"
tags = {
Name = "inference-${terraform.workspace}"
Workspace = terraform.workspace
ManagedBy = "terraform"
}
}
lifecycle {
create_before_destroy = true
}
}
The ASG with mixed instance policy (on-demand fallback if spot unavailable):
# asg.tf
resource "aws_autoscaling_group" "inference" {
name = "inference-asg-${terraform.workspace}"
vpc_zone_identifier = var.private_subnet_ids
min_size = 0 # allow scale-to-zero for cost guard
max_size = var.max_gpu_instances
desired_capacity = 0 # start at zero; queue metric brings nodes up
health_check_type = "ELB"
health_check_grace_period = 360 # 6 min — enough for model load
# Drain instances gracefully before termination
termination_policies = ["OldestInstance"]
instance_refresh {
strategy = "Rolling"
preferences {
min_healthy_percentage = 50
}
}
mixed_instances_policy {
launch_template {
launch_template_specification {
launch_template_id = aws_launch_template.inference.id
version = "$Latest"
}
# Fallback chain: prefer g5.xlarge spot, then g5.2xlarge spot, then on-demand
override {
instance_type = "g5.xlarge"
weighted_capacity = "1"
}
override {
instance_type = "g5.2xlarge"
weighted_capacity = "2"
}
}
instances_distribution {
on_demand_base_capacity = 0
on_demand_percentage_above_base_capacity = 0 # 100% spot
spot_allocation_strategy = "capacity-optimized"
}
}
lifecycle {
ignore_changes = [desired_capacity] # let scaling policies own this
}
tag {
key = "Workspace"
value = terraform.workspace
propagate_at_launch = true
}
}
Step 3: SQS Queue and Custom CloudWatch Metric for Scaling
Scale on queue depth, not CPU. One message = one pending inference request.
# queue.tf
resource "aws_sqs_queue" "inference" {
name = "inference-queue-${terraform.workspace}"
visibility_timeout_seconds = 120 # longer than your p99 inference latency
message_retention_seconds = 3600
# Prevent messages from piling up invisibly
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.inference_dlq.arn
maxReceiveCount = 3
})
}
resource "aws_sqs_queue" "inference_dlq" {
name = "inference-dlq-${terraform.workspace}"
message_retention_seconds = 86400 # 24h; gives you time to inspect failures
}
The scaling policy uses ApproximateNumberOfMessagesVisible — the number of messages waiting for a worker:
# scaling_policy.tf
resource "aws_autoscaling_policy" "scale_out" {
name = "inference-scale-out-${terraform.workspace}"
autoscaling_group_name = aws_autoscaling_group.inference.name
policy_type = "TargetTrackingScaling"
target_tracking_configuration {
customized_metric_specification {
metric_name = "ApproximateNumberOfMessagesVisible"
namespace = "AWS/SQS"
statistic = "Average"
dimensions {
name = "QueueName"
value = aws_sqs_queue.inference.name
}
}
# Target: 10 pending messages per GPU instance
# Tune this based on your model's requests/sec throughput
target_value = 10
# Do not scale in automatically — use a separate conservative scale-in policy
disable_scale_in = true
}
}
resource "aws_autoscaling_policy" "scale_in" {
name = "inference-scale-in-${terraform.workspace}"
autoscaling_group_name = aws_autoscaling_group.inference.name
policy_type = "StepScaling"
adjustment_type = "ChangeInCapacity"
step_adjustment {
scaling_adjustment = -1
metric_interval_upper_bound = 0 # trigger when metric < threshold
}
# Long cooldown prevents thrashing — GPU startup cost makes rapid cycling expensive
cooldown = var.scale_in_cooldown_seconds
}
resource "aws_cloudwatch_metric_alarm" "queue_empty" {
alarm_name = "inference-queue-empty-${terraform.workspace}"
comparison_operator = "LessThanThreshold"
evaluation_periods = 3 # queue empty for 3 consecutive 1-min periods
metric_name = "ApproximateNumberOfMessagesVisible"
namespace = "AWS/SQS"
period = 60
statistic = "Average"
threshold = 1
dimensions = {
QueueName = aws_sqs_queue.inference.name
}
alarm_actions = [aws_autoscaling_policy.scale_in.arn]
}
Step 4: Budget Guard — Scale to Zero When Spend Hits Threshold
This is the cost guard. When monthly spend hits 80% of budget, an alert fires. At 100%, a Lambda function forces desired_capacity = 0.
# budget_guard.tf
resource "aws_budgets_budget" "gpu_inference" {
name = "gpu-inference-${terraform.workspace}"
budget_type = "COST"
limit_amount = var.monthly_budget_usd
limit_unit = "USD"
time_unit = "MONTHLY"
cost_filter {
name = "TagKeyValue"
values = ["user:Workspace$${terraform.workspace}"]
}
# Warning at 80%
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = var.alert_emails
}
# Hard cutoff at 100% — triggers Lambda via SNS
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_sns_arns = [aws_sns_topic.budget_breach.arn]
}
}
resource "aws_sns_topic" "budget_breach" {
name = "budget-breach-${terraform.workspace}"
}
resource "aws_sns_topic_subscription" "budget_breach_lambda" {
topic_arn = aws_sns_topic.budget_breach.arn
protocol = "lambda"
endpoint = aws_lambda_function.scale_to_zero.arn
}
The Lambda function that kills the GPU fleet:
# lambda_scale_to_zero.tf
data "archive_file" "scale_to_zero" {
type = "zip"
output_path = "${path.module}/lambda/scale_to_zero.zip"
source_dir = "${path.module}/lambda/scale_to_zero"
}
resource "aws_lambda_function" "scale_to_zero" {
function_name = "inference-scale-to-zero-${terraform.workspace}"
filename = data.archive_file.scale_to_zero.output_path
source_code_hash = data.archive_file.scale_to_zero.output_base64sha256
handler = "index.handler"
runtime = "python3.12"
timeout = 30
environment {
variables = {
ASG_NAME = aws_autoscaling_group.inference.name
WORKSPACE = terraform.workspace
}
}
role = aws_iam_role.scale_to_zero_lambda.arn
}
resource "aws_lambda_permission" "sns_invoke" {
statement_id = "AllowSNSInvoke"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.scale_to_zero.function_name
principal = "sns.amazonaws.com"
source_arn = aws_sns_topic.budget_breach.arn
}
The Lambda handler (lambda/scale_to_zero/index.py):
import boto3
import os
import json
asg_client = boto3.client("autoscaling")
def handler(event, context):
asg_name = os.environ["ASG_NAME"]
workspace = os.environ["WORKSPACE"]
# Log the budget breach for audit trail
print(json.dumps({
"action": "scale_to_zero",
"asg": asg_name,
"workspace": workspace,
"trigger": "budget_breach",
"sns_message": event["Records"][0]["Sns"]["Message"]
}))
# Set min and desired to 0 — prevents ASG from launching new instances
# Does NOT kill in-flight requests; existing instances finish their queue messages
asg_client.update_auto_scaling_group(
AutoScalingGroupName=asg_name,
MinSize=0,
DesiredCapacity=0,
)
# Suspend launch process — budget alerts can lag by up to 8 hours
# Suspension ensures no new instances start even if metric spikes
asg_client.suspend_processes(
AutoScalingGroupName=asg_name,
ScalingProcesses=["Launch"],
)
return {"statusCode": 200, "body": f"Scaled {asg_name} to zero"}
Important: AWS Budget alerts can lag up to 8 hours. The
suspend_processescall is critical — it prevents the scaling policy from launching new instances even while the budget notification is in transit.
Step 5: IAM — Least-Privilege Roles
# iam.tf
# Inference instance role — can only read models from S3 and write to SQS
resource "aws_iam_role" "inference" {
name = "inference-instance-${terraform.workspace}"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Service = "ec2.amazonaws.com" }
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_role_policy" "inference_permissions" {
name = "inference-permissions"
role = aws_iam_role.inference.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "ReadModelWeights"
Effect = "Allow"
Action = ["s3:GetObject", "s3:ListBucket"]
Resource = [
"arn:aws:s3:::${var.model_bucket}",
"arn:aws:s3:::${var.model_bucket}/*"
]
},
{
Sid = "ConsumeInferenceQueue"
Effect = "Allow"
Action = [
"sqs:ReceiveMessage",
"sqs:DeleteMessage",
"sqs:GetQueueAttributes",
"sqs:ChangeMessageVisibility"
]
Resource = aws_sqs_queue.inference.arn
}
]
})
}
# Lambda role — can only update the one ASG it owns
resource "aws_iam_role" "scale_to_zero_lambda" {
name = "scale-to-zero-lambda-${terraform.workspace}"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Service = "lambda.amazonaws.com" }
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_role_policy" "scale_to_zero_permissions" {
name = "asg-scale-permissions"
role = aws_iam_role.scale_to_zero_lambda.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"autoscaling:UpdateAutoScalingGroup",
"autoscaling:SuspendProcesses",
"autoscaling:ResumeProcesses"
]
Resource = aws_autoscaling_group.inference.arn
},
{
Effect = "Allow"
Action = ["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"]
Resource = "arn:aws:logs:*:*:*"
}
]
})
}
Step 6: Bootstrap Script for GPU Instance Startup
The bootstrap.sh referenced in the launch template. This runs on every new instance:
#!/bin/bash
# scripts/bootstrap.sh
# Runs on instance startup — pulls model weights and starts inference server
set -euo pipefail
MODEL_S3_PATH="${model_s3_path}"
SQS_QUEUE_URL="${sqs_queue_url}"
WORKSPACE="${workspace}"
echo "[bootstrap] Starting inference node for workspace: $WORKSPACE"
# Pull model weights from S3 to local NVMe
# aws s3 sync is resumable — spot interruptions mid-download won't re-download from scratch
mkdir -p /opt/models
aws s3 sync "$MODEL_S3_PATH" /opt/models/ \
--no-progress \
--only-show-errors
echo "[bootstrap] Model weights synced"
# Install inference server (adjust for your stack — vLLM, TGI, Ollama serve, etc.)
pip install vllm==0.4.3 --quiet
# Start vLLM with the queue consumer
python3 /opt/inference/server.py \
--model /opt/models \
--sqs-queue-url "$SQS_QUEUE_URL" \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 32 &
echo "[bootstrap] Inference server started"
# Signal ASG health check — instance is ready to receive traffic
# Without this, ALB keeps the instance OutOfService until health check passes
TOKEN=$(curl -sS -X PUT "http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
INSTANCE_ID=$(curl -sS -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/instance-id)
aws autoscaling complete-lifecycle-action \
--lifecycle-action-result CONTINUE \
--instance-id "$INSTANCE_ID" \
--lifecycle-hook-name inference-launch-hook \
--auto-scaling-group-name "inference-asg-$WORKSPACE"
Verification
Apply the configuration and verify each layer:
# Apply to dev workspace first
terraform workspace select dev
terraform plan -var-file=dev.tfvars -out=dev.plan
terraform apply dev.plan
Check ASG is at zero (correct — waits for queue messages):
aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names inference-asg-dev \
--query 'AutoScalingGroups[0].{Min:MinSize,Max:MaxSize,Desired:DesiredCapacity}'
Expected:
{ "Min": 0, "Max": 1, "Desired": 0 }
Trigger scale-out by sending test messages to the queue:
# Send 15 messages — should trigger scale-out (target is 10 messages/instance)
for i in $(seq 1 15); do
aws sqs send-message \
--queue-url "$(terraform output -raw sqs_queue_url)" \
--message-body "{\"request_id\": \"test-$i\", \"prompt\": \"Hello world\"}"
done
# Watch ASG desired capacity climb within ~90 seconds
watch -n 10 "aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names inference-asg-dev \
--query 'AutoScalingGroups[0].DesiredCapacity'"
Test the budget guard Lambda directly:
aws lambda invoke \
--function-name inference-scale-to-zero-dev \
--payload '{"Records":[{"Sns":{"Message":"Budget breach test"}}]}' \
--cli-binary-format raw-in-base64-out \
response.json
cat response.json
# {"statusCode": 200, "body": "Scaled inference-asg-dev to zero"}
# Verify ASG is suspended
aws autoscaling describe-scaling-process-types \
--auto-scaling-group-name inference-asg-dev \
--query 'SuspendedProcesses'
You should see: Launch in the suspended processes list.
Resume after resolving the budget issue:
aws autoscaling resume-processes \
--auto-scaling-group-name inference-asg-dev \
--scaling-processes Launch
What You Learned
- Scale GPU ASGs on SQS queue depth, not CPU — CPU metrics are meaningless for GPU inference
min_size = 0+desired_capacity = 0enables true scale-to-zero; you pay nothing when idlesuspend_processesis more reliable thandesired_capacity = 0alone for budget enforcement, because scaling policies can fight a desired-capacity change- AWS Budget alert lag (up to 8 hours) means you need a hard process suspension, not just a notification
- Spot instances with
capacity-optimizedallocation cut GPU costs 60–70% vs on-demand; the fallback chain handles interruptions transparently
Limitations to plan for:
- This pattern is AWS-specific. GCP equivalent uses Managed Instance Groups + Pub/Sub + Cloud Budget API — same architecture, different Terraform resources.
- Budget alerts are eventually consistent. For stricter enforcement, run a scheduled Lambda every 15 minutes that checks Cost Explorer and suspends proactively.
complete-lifecycle-actionin the bootstrap script requires a lifecycle hook on the ASG. Addaws_autoscaling_lifecycle_hookresource if you use the health check pattern above.
Tested on Terraform 1.8.2, AWS provider 5.43.0, g5.xlarge (A10G 24GB), Amazon Linux 2 Deep Learning AMI, us-east-1