Problem: Gemini 2.0 in Dev Works — Production Breaks Under Load

You've tested Gemini 2.0 with direct API calls. Then you hit production: rate limits, cold starts, missing IAM permissions, and no fallback when the model endpoint throws a 503.

Deploying Gemini 2.0 at scale on Vertex AI means wiring together autoscaling, VPC service controls, dedicated endpoints, and observability — none of which is obvious from the quickstart docs.

You'll learn:

Provision a Vertex AI dedicated endpoint for Gemini 2.0 Flash and Pro
Configure autoscaling and traffic splitting between model versions
Lock down access with Workload Identity and VPC Service Controls
Stream responses with the Python SDK and handle retries correctly

Time: 30 min | Difficulty: Advanced

Why Direct API Calls Don't Scale

The generativelanguage.googleapis.com endpoint is rate-limited per project and shared across all callers. At scale you need:

Dedicated throughput — Vertex AI endpoints give you reserved capacity
Private networking — route traffic over VPC without leaving Google's backbone
Per-service IAM — Workload Identity binds a K8s service account to a GCP service account, no JSON keys
Traffic splitting — shift load between gemini-2.0-flash and gemini-2.0-pro without a redeploy

The Vertex AI ModelGardenEndpoint gives you all four.

Solution

Step 1: Enable APIs and Set Project Variables

# Set your project once — used in every command below
export PROJECT_ID="your-project-id"
export REGION="us-central1"   # Gemini 2.0 is available in us-central1, europe-west4, asia-northeast1

gcloud config set project $PROJECT_ID

# Enable required APIs
gcloud services enable \
  aiplatform.googleapis.com \
  compute.googleapis.com \
  container.googleapis.com \
  cloudresourcemanager.googleapis.com \
  --project=$PROJECT_ID

Expected output:

Operation "operations/acf...." finished successfully.

If it fails:

PERMISSION_DENIED → You need roles/serviceusage.serviceUsageAdmin on the project

Step 2: Create a Dedicated Vertex AI Endpoint

# Dedicated endpoints give you reserved capacity and a private DNS name
gcloud ai endpoints create \
  --region=$REGION \
  --display-name="gemini-prod-endpoint" \
  --network="projects/$PROJECT_ID/global/networks/default" \
  --project=$PROJECT_ID

# Capture the endpoint ID for the next steps
export ENDPOINT_ID=$(gcloud ai endpoints list \
  --region=$REGION \
  --filter="displayName=gemini-prod-endpoint" \
  --format="value(name)" | awk -F'/' '{print $NF}')

echo "Endpoint ID: $ENDPOINT_ID"

The --network flag routes traffic privately over VPC. Remove it only for public dev endpoints — never in production.

Step 3: Deploy Gemini 2.0 Flash as the Primary Model

# Deploy Flash as primary (95% traffic) — faster, cheaper, handles most requests
gcloud ai endpoints deploy-model $ENDPOINT_ID \
  --region=$REGION \
  --model="publishers/google/models/gemini-2.0-flash-001" \
  --display-name="gemini-flash-primary" \
  --traffic-split="0=95" \
  --min-replica-count=2 \
  --max-replica-count=20 \
  --machine-type="n1-standard-4" \
  --project=$PROJECT_ID

Expected: Deployment takes 3–5 minutes. You'll see deployedModelId in the output — save it.

export FLASH_MODEL_ID="<deployedModelId from output>"

If it fails:

Quota exceeded → Request quota increase for custom-model-training-us-central1 in the Cloud Console
Model not found → Confirm Gemini 2.0 is available in your region: gcloud ai models list --region=$REGION --filter="publisher=google"

Step 4: Deploy Gemini 2.0 Pro for Complex Requests

# Pro handles the 5% of requests needing deeper reasoning
gcloud ai endpoints deploy-model $ENDPOINT_ID \
  --region=$REGION \
  --model="publishers/google/models/gemini-2.0-pro-exp-02-05" \
  --display-name="gemini-pro-fallback" \
  --traffic-split="$FLASH_MODEL_ID=95,0=5" \
  --min-replica-count=1 \
  --max-replica-count=5 \
  --machine-type="n1-standard-8" \
  --project=$PROJECT_ID

The traffic-split key 0 means "this new model". The existing Flash model ($FLASH_MODEL_ID) gets 95% and Pro gets 5%.

Step 5: Configure Workload Identity for Your App

Never use a downloaded JSON service account key in production. Workload Identity binds your GKE pod's service account directly to a GCP service account.

# Create a dedicated GCP service account for Gemini calls
gcloud iam service-accounts create gemini-inference-sa \
  --display-name="Gemini Inference Service Account" \
  --project=$PROJECT_ID

# Grant it the Vertex AI User role (predict only — no admin)
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:gemini-inference-sa@$PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/aiplatform.user"

# Allow your Kubernetes service account to impersonate it
# Replace NAMESPACE and KSA_NAME with your actual values
export NAMESPACE="production"
export KSA_NAME="gemini-app"

gcloud iam service-accounts add-iam-policy-binding \
  gemini-inference-sa@$PROJECT_ID.iam.gserviceaccount.com \
  --role="roles/iam.workloadIdentityUser" \
  --member="serviceAccount:$PROJECT_ID.svc.id.goog[$NAMESPACE/$KSA_NAME]"

Then annotate your Kubernetes service account:

# k8s/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: gemini-app
  namespace: production
  annotations:
    # This is what links the K8s SA to the GCP SA
    iam.gke.io/gcp-service-account: gemini-inference-sa@YOUR_PROJECT_ID.iam.gserviceaccount.com

Step 6: Call the Endpoint with Streaming and Retry Logic

# requirements: google-cloud-aiplatform>=1.50.0, tenacity

import vertexai
from vertexai.generative_models import GenerativeModel, GenerationConfig
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from google.api_core.exceptions import ResourceExhausted, ServiceUnavailable
import os

PROJECT_ID = os.environ["GCP_PROJECT_ID"]
REGION = os.environ.get("GCP_REGION", "us-central1")
ENDPOINT_ID = os.environ["VERTEX_ENDPOINT_ID"]

vertexai.init(project=PROJECT_ID, location=REGION)

# Retry on rate limits and transient 5xx — exponential backoff caps at 60s
@retry(
    retry=retry_if_exception_type((ResourceExhausted, ServiceUnavailable)),
    wait=wait_exponential(multiplier=1, min=2, max=60),
    stop=stop_after_attempt(5),
)
def generate_streaming(prompt: str, max_tokens: int = 2048) -> str:
    model = GenerativeModel(
        # Use endpoint ID to route through your dedicated endpoint
        model_name=f"projects/{PROJECT_ID}/locations/{REGION}/endpoints/{ENDPOINT_ID}",
    )

    config = GenerationConfig(
        max_output_tokens=max_tokens,
        temperature=0.2,      # Lower = more deterministic; good for code and structured output
        top_p=0.95,
    )

    full_response = []

    # Stream chunks as they arrive — reduces time-to-first-token for users
    for chunk in model.generate_content(
        prompt,
        generation_config=config,
        stream=True,
    ):
        if chunk.text:
            full_response.append(chunk.text)
            print(chunk.text, end="", flush=True)

    return "".join(full_response)


if __name__ == "__main__":
    result = generate_streaming(
        "Explain the difference between Gemini 2.0 Flash and Pro in 3 bullet points."
    )
    print(f"\n\nTotal tokens used: {len(result.split())}")

Step 7: Set Up VPC Service Controls (Production Only)

VPC Service Controls prevent data exfiltration — Gemini API calls that don't originate from your VPC are rejected.

# Create an access policy (org-level, do once)
gcloud access-context-manager policies create \
  --organization=$ORG_ID \
  --title="prod-access-policy"

export POLICY_NAME=$(gcloud access-context-manager policies list \
  --organization=$ORG_ID --format="value(name)")

# Create a service perimeter around your AI project
gcloud access-context-manager perimeters create gemini-prod-perimeter \
  --policy=$POLICY_NAME \
  --title="Gemini Production Perimeter" \
  --resources="projects/$PROJECT_NUMBER" \
  --restricted-services="aiplatform.googleapis.com" \
  --access-levels=""

Important: Test the perimeter in dry-run mode first (--perimeter-type=DRY_RUN) or you'll lock yourself out.

Verification

# Test a prediction directly from gcloud
gcloud ai endpoints predict $ENDPOINT_ID \
  --region=$REGION \
  --json-request='{
    "instances": [{
      "content": "What is 2 + 2? Reply with just the number."
    }]
  }' \
  --project=$PROJECT_ID

You should see: A JSON response with "4" in under 2 seconds.

Check autoscaler activity:

# Watch replica count respond to load
gcloud ai operations list \
  --region=$REGION \
  --filter="metadata.endpointId=$ENDPOINT_ID" \
  --sort-by="~createTime" \
  --limit=10

Check traffic split is active:

gcloud ai endpoints describe $ENDPOINT_ID \
  --region=$REGION \
  --format="yaml(trafficSplit)"

You should see: Flash at 95%, Pro at 5%.

Production Considerations

Cost: Gemini 2.0 Flash on a dedicated endpoint with 2 minimum replicas runs roughly $0.35/hr at idle. Set min-replica-count=0 only for dev — cold starts on dedicated endpoints take 45–90 seconds and will timeout user requests.

Quotas: The default online-prediction-requests-per-minute quota is 600 RPM per region. For sustained high throughput, request a quota increase before go-live — not after.

Observability: Enable Vertex AI model monitoring in the Cloud Console under your endpoint. It captures input/output distributions and will alert on prompt injection patterns or unexpected output shifts.

Fallback strategy: If Vertex AI returns a 503, fall back to the generativelanguage.googleapis.com endpoint with a separate API key. Keep the fallback in a circuit breaker — don't let a Vertex outage cascade into unbounded calls to the public endpoint.

What You Learned

Dedicated endpoints give reserved throughput and private VPC routing — worth the idle cost in production
Traffic splitting lets you A/B test Flash vs Pro without a code change
Workload Identity is the correct credential pattern for GKE — no JSON keys
Stream responses to cut perceived latency; retry only on ResourceExhausted and ServiceUnavailable

Limitation: Dedicated endpoints don't support gemini-2.0-flash-exp or experimental model IDs — pin to a stable version (-001 suffix) or the endpoint deploy will fail.

Tested on Vertex AI SDK 1.52.0, Python 3.12, GKE 1.29, us-central1 region