Terraform AI Infrastructure: GPU Autoscaling and Cost Guards 2026

Provision GPU instances for AI workloads with Terraform, auto-scale on inference demand, and enforce budget limits to prevent runaway cloud bills.

Problem: GPU Instances Are Expensive and Hard to Right-Size

You spun up a GPU instance for inference. It idles at 3% utilization 22 hours a day — costing $600/month — then falls over under load the other 2 hours.

Manual scaling doesn't work. Setting a fixed cluster size means you're either wasting money or dropping requests. You need infrastructure that scales with real demand and refuses to spend past a defined budget.

You'll learn:

  • How to provision GPU instances on AWS (g5) and GCP (a2) using Terraform modules
  • Auto Scaling Group configuration tied to custom inference queue metrics
  • Budget alerts and hard-cutoff cost guards using AWS Budgets + Lambda and GCP Budget API
  • How to structure Terraform workspaces so dev, staging, and prod have different scale limits

Time: 40 min | Difficulty: Advanced


Why Standard Autoscaling Doesn't Work for AI Workloads

EC2 Auto Scaling and GCP Managed Instance Groups scale on CPU and memory by default. GPU inference workloads break both assumptions:

  • GPU utilization is binary. A model server at 5% GPU utilization and one at 95% both look idle to the CPU metric — until the 95% one starts queuing and crashing.
  • Cold starts are expensive. A g5.xlarge takes 4–6 minutes to boot, pull the model, and serve its first request. Standard reactive scaling is too slow.
  • Spot/preemptible interruptions corrupt in-flight requests. You need a drain-and-replace strategy, not a kill-and-replace one.

The solution: scale on inference queue depth (SQS or Pub/Sub message count), not on instance metrics. Add a budget guard that scales the group to zero if spend crosses a threshold.


Architecture Overview

Request traffic
      │
      ▼
  Load Balancer (ALB / Cloud Load Balancing)
      │
      ▼
  Inference Queue (SQS / Pub/Sub)  ◀── queue depth metric
      │                                        │
      ▼                                        ▼
  Model Server ASG / MIG               Scaling Policy
  (g5.xlarge / a2-highgpu-1g)                  │
      │                                        ▼
      ▼                              Budget Guard Lambda / CF
  S3 / GCS model cache              (scale-to-zero if over budget)

Terraform manages every layer. State lives in S3 + DynamoDB (AWS) or GCS (GCP) with locking.


Solution

Step 1: Set Up Terraform Workspaces and Remote State

Separate workspaces prevent a terraform apply in dev from touching prod GPU instances.

# Initialize with S3 backend (AWS example)
terraform init \
  -backend-config="bucket=my-tf-state-2026" \
  -backend-config="key=ai-infra/terraform.tfstate" \
  -backend-config="region=us-east-1" \
  -backend-config="dynamodb_table=tf-state-lock"

# Create isolated workspaces
terraform workspace new dev
terraform workspace new staging
terraform workspace new prod

Your backend.tf:

terraform {
  required_version = ">= 1.7.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.40"
    }
  }

  backend "s3" {
    # Values injected via -backend-config at init time
    # Never hardcode bucket names in version-controlled code
  }
}

Your variables.tf — workspace-aware GPU limits:

variable "gpu_instance_type" {
  description = "EC2 instance type for inference nodes"
  type        = string
  default     = "g5.xlarge"  # 24GB A10G; use g5.2xlarge for 70B models
}

variable "max_gpu_instances" {
  description = "Hard ceiling on GPU node count — enforced per workspace"
  type        = number
  # dev=1, staging=2, prod=10 — set via workspace-specific tfvars
}

variable "monthly_budget_usd" {
  description = "Hard budget limit in USD; triggers scale-to-zero at threshold"
  type        = number
}

variable "scale_in_cooldown_seconds" {
  description = "Wait this long before removing a node — lets in-flight requests drain"
  type        = number
  default     = 300  # 5 min; long enough for a p99 inference call to finish
}

Create per-workspace var files:

# dev.tfvars
max_gpu_instances    = 1
monthly_budget_usd   = 150

# staging.tfvars
max_gpu_instances    = 2
monthly_budget_usd   = 400

# prod.tfvars
max_gpu_instances    = 10
monthly_budget_usd   = 3000

Apply with:

terraform workspace select prod
terraform apply -var-file=prod.tfvars

Step 2: Provision the GPU Launch Template and ASG

# launch_template.tf

data "aws_ami" "gpu_base" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["Deep Learning Base OSS Nvidia Driver GPU AMI (Amazon Linux 2) *"]
  }
}

resource "aws_launch_template" "inference" {
  name_prefix   = "inference-${terraform.workspace}-"
  image_id      = data.aws_ami.gpu_base.id
  instance_type = var.gpu_instance_type

  # Spot instances cut g5.xlarge cost by ~70% vs on-demand
  # instance_market_options enables spot with on-demand fallback via ASG
  instance_market_options {
    market_type = "spot"
    spot_options {
      # max_price left unset = never pay more than on-demand rate
      instance_interruption_behavior = "terminate"
    }
  }

  iam_instance_profile {
    name = aws_iam_instance_profile.inference.name
  }

  network_interfaces {
    associate_public_ip_address = false
    security_groups             = [aws_security_group.inference.id]
  }

  block_device_mappings {
    device_name = "/dev/xvda"
    ebs {
      volume_size           = 100   # GB; enough for a 70B Q4 model
      volume_type           = "gp3"
      iops                  = 4000
      delete_on_termination = true
    }
  }

  user_data = base64encode(templatefile("${path.module}/scripts/bootstrap.sh", {
    model_s3_path = var.model_s3_path
    sqs_queue_url = aws_sqs_queue.inference.url
    workspace     = terraform.workspace
  }))

  tag_specifications {
    resource_type = "instance"
    tags = {
      Name      = "inference-${terraform.workspace}"
      Workspace = terraform.workspace
      ManagedBy = "terraform"
    }
  }

  lifecycle {
    create_before_destroy = true
  }
}

The ASG with mixed instance policy (on-demand fallback if spot unavailable):

# asg.tf

resource "aws_autoscaling_group" "inference" {
  name                = "inference-asg-${terraform.workspace}"
  vpc_zone_identifier = var.private_subnet_ids

  min_size         = 0  # allow scale-to-zero for cost guard
  max_size         = var.max_gpu_instances
  desired_capacity = 0  # start at zero; queue metric brings nodes up

  health_check_type         = "ELB"
  health_check_grace_period = 360  # 6 min — enough for model load

  # Drain instances gracefully before termination
  termination_policies = ["OldestInstance"]

  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 50
    }
  }

  mixed_instances_policy {
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.inference.id
        version            = "$Latest"
      }

      # Fallback chain: prefer g5.xlarge spot, then g5.2xlarge spot, then on-demand
      override {
        instance_type     = "g5.xlarge"
        weighted_capacity = "1"
      }
      override {
        instance_type     = "g5.2xlarge"
        weighted_capacity = "2"
      }
    }

    instances_distribution {
      on_demand_base_capacity                  = 0
      on_demand_percentage_above_base_capacity = 0    # 100% spot
      spot_allocation_strategy                 = "capacity-optimized"
    }
  }

  lifecycle {
    ignore_changes = [desired_capacity]  # let scaling policies own this
  }

  tag {
    key                 = "Workspace"
    value               = terraform.workspace
    propagate_at_launch = true
  }
}

Step 3: SQS Queue and Custom CloudWatch Metric for Scaling

Scale on queue depth, not CPU. One message = one pending inference request.

# queue.tf

resource "aws_sqs_queue" "inference" {
  name                       = "inference-queue-${terraform.workspace}"
  visibility_timeout_seconds = 120  # longer than your p99 inference latency
  message_retention_seconds  = 3600

  # Prevent messages from piling up invisibly
  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.inference_dlq.arn
    maxReceiveCount     = 3
  })
}

resource "aws_sqs_queue" "inference_dlq" {
  name                      = "inference-dlq-${terraform.workspace}"
  message_retention_seconds = 86400  # 24h; gives you time to inspect failures
}

The scaling policy uses ApproximateNumberOfMessagesVisible — the number of messages waiting for a worker:

# scaling_policy.tf

resource "aws_autoscaling_policy" "scale_out" {
  name                   = "inference-scale-out-${terraform.workspace}"
  autoscaling_group_name = aws_autoscaling_group.inference.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    customized_metric_specification {
      metric_name = "ApproximateNumberOfMessagesVisible"
      namespace   = "AWS/SQS"
      statistic   = "Average"

      dimensions {
        name  = "QueueName"
        value = aws_sqs_queue.inference.name
      }
    }

    # Target: 10 pending messages per GPU instance
    # Tune this based on your model's requests/sec throughput
    target_value = 10

    # Do not scale in automatically — use a separate conservative scale-in policy
    disable_scale_in = true
  }
}

resource "aws_autoscaling_policy" "scale_in" {
  name                   = "inference-scale-in-${terraform.workspace}"
  autoscaling_group_name = aws_autoscaling_group.inference.name
  policy_type            = "StepScaling"
  adjustment_type        = "ChangeInCapacity"

  step_adjustment {
    scaling_adjustment          = -1
    metric_interval_upper_bound = 0  # trigger when metric < threshold
  }

  # Long cooldown prevents thrashing — GPU startup cost makes rapid cycling expensive
  cooldown = var.scale_in_cooldown_seconds
}

resource "aws_cloudwatch_metric_alarm" "queue_empty" {
  alarm_name          = "inference-queue-empty-${terraform.workspace}"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = 3      # queue empty for 3 consecutive 1-min periods
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  period              = 60
  statistic           = "Average"
  threshold           = 1

  dimensions = {
    QueueName = aws_sqs_queue.inference.name
  }

  alarm_actions = [aws_autoscaling_policy.scale_in.arn]
}

Step 4: Budget Guard — Scale to Zero When Spend Hits Threshold

This is the cost guard. When monthly spend hits 80% of budget, an alert fires. At 100%, a Lambda function forces desired_capacity = 0.

# budget_guard.tf

resource "aws_budgets_budget" "gpu_inference" {
  name         = "gpu-inference-${terraform.workspace}"
  budget_type  = "COST"
  limit_amount = var.monthly_budget_usd
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  cost_filter {
    name   = "TagKeyValue"
    values = ["user:Workspace$${terraform.workspace}"]
  }

  # Warning at 80%
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = var.alert_emails
  }

  # Hard cutoff at 100% — triggers Lambda via SNS
  notification {
    comparison_operator = "GREATER_THAN"
    threshold           = 100
    threshold_type      = "PERCENTAGE"
    notification_type   = "ACTUAL"
    subscriber_sns_arns = [aws_sns_topic.budget_breach.arn]
  }
}

resource "aws_sns_topic" "budget_breach" {
  name = "budget-breach-${terraform.workspace}"
}

resource "aws_sns_topic_subscription" "budget_breach_lambda" {
  topic_arn = aws_sns_topic.budget_breach.arn
  protocol  = "lambda"
  endpoint  = aws_lambda_function.scale_to_zero.arn
}

The Lambda function that kills the GPU fleet:

# lambda_scale_to_zero.tf

data "archive_file" "scale_to_zero" {
  type        = "zip"
  output_path = "${path.module}/lambda/scale_to_zero.zip"
  source_dir  = "${path.module}/lambda/scale_to_zero"
}

resource "aws_lambda_function" "scale_to_zero" {
  function_name    = "inference-scale-to-zero-${terraform.workspace}"
  filename         = data.archive_file.scale_to_zero.output_path
  source_code_hash = data.archive_file.scale_to_zero.output_base64sha256
  handler          = "index.handler"
  runtime          = "python3.12"
  timeout          = 30

  environment {
    variables = {
      ASG_NAME  = aws_autoscaling_group.inference.name
      WORKSPACE = terraform.workspace
    }
  }

  role = aws_iam_role.scale_to_zero_lambda.arn
}

resource "aws_lambda_permission" "sns_invoke" {
  statement_id  = "AllowSNSInvoke"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.scale_to_zero.function_name
  principal     = "sns.amazonaws.com"
  source_arn    = aws_sns_topic.budget_breach.arn
}

The Lambda handler (lambda/scale_to_zero/index.py):

import boto3
import os
import json

asg_client = boto3.client("autoscaling")

def handler(event, context):
    asg_name = os.environ["ASG_NAME"]
    workspace = os.environ["WORKSPACE"]

    # Log the budget breach for audit trail
    print(json.dumps({
        "action": "scale_to_zero",
        "asg": asg_name,
        "workspace": workspace,
        "trigger": "budget_breach",
        "sns_message": event["Records"][0]["Sns"]["Message"]
    }))

    # Set min and desired to 0 — prevents ASG from launching new instances
    # Does NOT kill in-flight requests; existing instances finish their queue messages
    asg_client.update_auto_scaling_group(
        AutoScalingGroupName=asg_name,
        MinSize=0,
        DesiredCapacity=0,
    )

    # Suspend launch process — budget alerts can lag by up to 8 hours
    # Suspension ensures no new instances start even if metric spikes
    asg_client.suspend_processes(
        AutoScalingGroupName=asg_name,
        ScalingProcesses=["Launch"],
    )

    return {"statusCode": 200, "body": f"Scaled {asg_name} to zero"}

Important: AWS Budget alerts can lag up to 8 hours. The suspend_processes call is critical — it prevents the scaling policy from launching new instances even while the budget notification is in transit.


Step 5: IAM — Least-Privilege Roles

# iam.tf

# Inference instance role — can only read models from S3 and write to SQS
resource "aws_iam_role" "inference" {
  name = "inference-instance-${terraform.workspace}"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "ec2.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "inference_permissions" {
  name = "inference-permissions"
  role = aws_iam_role.inference.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "ReadModelWeights"
        Effect = "Allow"
        Action = ["s3:GetObject", "s3:ListBucket"]
        Resource = [
          "arn:aws:s3:::${var.model_bucket}",
          "arn:aws:s3:::${var.model_bucket}/*"
        ]
      },
      {
        Sid    = "ConsumeInferenceQueue"
        Effect = "Allow"
        Action = [
          "sqs:ReceiveMessage",
          "sqs:DeleteMessage",
          "sqs:GetQueueAttributes",
          "sqs:ChangeMessageVisibility"
        ]
        Resource = aws_sqs_queue.inference.arn
      }
    ]
  })
}

# Lambda role — can only update the one ASG it owns
resource "aws_iam_role" "scale_to_zero_lambda" {
  name = "scale-to-zero-lambda-${terraform.workspace}"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "lambda.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "scale_to_zero_permissions" {
  name = "asg-scale-permissions"
  role = aws_iam_role.scale_to_zero_lambda.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "autoscaling:UpdateAutoScalingGroup",
          "autoscaling:SuspendProcesses",
          "autoscaling:ResumeProcesses"
        ]
        Resource = aws_autoscaling_group.inference.arn
      },
      {
        Effect   = "Allow"
        Action   = ["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"]
        Resource = "arn:aws:logs:*:*:*"
      }
    ]
  })
}

Step 6: Bootstrap Script for GPU Instance Startup

The bootstrap.sh referenced in the launch template. This runs on every new instance:

#!/bin/bash
# scripts/bootstrap.sh
# Runs on instance startup — pulls model weights and starts inference server

set -euo pipefail

MODEL_S3_PATH="${model_s3_path}"
SQS_QUEUE_URL="${sqs_queue_url}"
WORKSPACE="${workspace}"

echo "[bootstrap] Starting inference node for workspace: $WORKSPACE"

# Pull model weights from S3 to local NVMe
# aws s3 sync is resumable — spot interruptions mid-download won't re-download from scratch
mkdir -p /opt/models
aws s3 sync "$MODEL_S3_PATH" /opt/models/ \
  --no-progress \
  --only-show-errors

echo "[bootstrap] Model weights synced"

# Install inference server (adjust for your stack — vLLM, TGI, Ollama serve, etc.)
pip install vllm==0.4.3 --quiet

# Start vLLM with the queue consumer
python3 /opt/inference/server.py \
  --model /opt/models \
  --sqs-queue-url "$SQS_QUEUE_URL" \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 32 &

echo "[bootstrap] Inference server started"

# Signal ASG health check — instance is ready to receive traffic
# Without this, ALB keeps the instance OutOfService until health check passes
TOKEN=$(curl -sS -X PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
INSTANCE_ID=$(curl -sS -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/instance-id)

aws autoscaling complete-lifecycle-action \
  --lifecycle-action-result CONTINUE \
  --instance-id "$INSTANCE_ID" \
  --lifecycle-hook-name inference-launch-hook \
  --auto-scaling-group-name "inference-asg-$WORKSPACE"

Verification

Apply the configuration and verify each layer:

# Apply to dev workspace first
terraform workspace select dev
terraform plan -var-file=dev.tfvars -out=dev.plan
terraform apply dev.plan

Check ASG is at zero (correct — waits for queue messages):

aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names inference-asg-dev \
  --query 'AutoScalingGroups[0].{Min:MinSize,Max:MaxSize,Desired:DesiredCapacity}'

Expected:

{ "Min": 0, "Max": 1, "Desired": 0 }

Trigger scale-out by sending test messages to the queue:

# Send 15 messages — should trigger scale-out (target is 10 messages/instance)
for i in $(seq 1 15); do
  aws sqs send-message \
    --queue-url "$(terraform output -raw sqs_queue_url)" \
    --message-body "{\"request_id\": \"test-$i\", \"prompt\": \"Hello world\"}"
done

# Watch ASG desired capacity climb within ~90 seconds
watch -n 10 "aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names inference-asg-dev \
  --query 'AutoScalingGroups[0].DesiredCapacity'"

Test the budget guard Lambda directly:

aws lambda invoke \
  --function-name inference-scale-to-zero-dev \
  --payload '{"Records":[{"Sns":{"Message":"Budget breach test"}}]}' \
  --cli-binary-format raw-in-base64-out \
  response.json

cat response.json
# {"statusCode": 200, "body": "Scaled inference-asg-dev to zero"}

# Verify ASG is suspended
aws autoscaling describe-scaling-process-types \
  --auto-scaling-group-name inference-asg-dev \
  --query 'SuspendedProcesses'

You should see: Launch in the suspended processes list.

Resume after resolving the budget issue:

aws autoscaling resume-processes \
  --auto-scaling-group-name inference-asg-dev \
  --scaling-processes Launch

What You Learned

  • Scale GPU ASGs on SQS queue depth, not CPU — CPU metrics are meaningless for GPU inference
  • min_size = 0 + desired_capacity = 0 enables true scale-to-zero; you pay nothing when idle
  • suspend_processes is more reliable than desired_capacity = 0 alone for budget enforcement, because scaling policies can fight a desired-capacity change
  • AWS Budget alert lag (up to 8 hours) means you need a hard process suspension, not just a notification
  • Spot instances with capacity-optimized allocation cut GPU costs 60–70% vs on-demand; the fallback chain handles interruptions transparently

Limitations to plan for:

  • This pattern is AWS-specific. GCP equivalent uses Managed Instance Groups + Pub/Sub + Cloud Budget API — same architecture, different Terraform resources.
  • Budget alerts are eventually consistent. For stricter enforcement, run a scheduled Lambda every 15 minutes that checks Cost Explorer and suspends proactively.
  • complete-lifecycle-action in the bootstrap script requires a lifecycle hook on the ASG. Add aws_autoscaling_lifecycle_hook resource if you use the health check pattern above.

Tested on Terraform 1.8.2, AWS provider 5.43.0, g5.xlarge (A10G 24GB), Amazon Linux 2 Deep Learning AMI, us-east-1