Problem: Gemini 2.0 in Dev Works — Production Breaks Under Load
You've tested Gemini 2.0 with direct API calls. Then you hit production: rate limits, cold starts, missing IAM permissions, and no fallback when the model endpoint throws a 503.
Deploying Gemini 2.0 at scale on Vertex AI means wiring together autoscaling, VPC service controls, dedicated endpoints, and observability — none of which is obvious from the quickstart docs.
You'll learn:
- Provision a Vertex AI dedicated endpoint for Gemini 2.0 Flash and Pro
- Configure autoscaling and traffic splitting between model versions
- Lock down access with Workload Identity and VPC Service Controls
- Stream responses with the Python SDK and handle retries correctly
Time: 30 min | Difficulty: Advanced
Why Direct API Calls Don't Scale
The generativelanguage.googleapis.com endpoint is rate-limited per project and shared across all callers. At scale you need:
- Dedicated throughput — Vertex AI endpoints give you reserved capacity
- Private networking — route traffic over VPC without leaving Google's backbone
- Per-service IAM — Workload Identity binds a K8s service account to a GCP service account, no JSON keys
- Traffic splitting — shift load between
gemini-2.0-flashandgemini-2.0-prowithout a redeploy
The Vertex AI ModelGardenEndpoint gives you all four.
Solution
Step 1: Enable APIs and Set Project Variables
# Set your project once — used in every command below
export PROJECT_ID="your-project-id"
export REGION="us-central1" # Gemini 2.0 is available in us-central1, europe-west4, asia-northeast1
gcloud config set project $PROJECT_ID
# Enable required APIs
gcloud services enable \
aiplatform.googleapis.com \
compute.googleapis.com \
container.googleapis.com \
cloudresourcemanager.googleapis.com \
--project=$PROJECT_ID
Expected output:
Operation "operations/acf...." finished successfully.
If it fails:
PERMISSION_DENIED→ You needroles/serviceusage.serviceUsageAdminon the project
Step 2: Create a Dedicated Vertex AI Endpoint
# Dedicated endpoints give you reserved capacity and a private DNS name
gcloud ai endpoints create \
--region=$REGION \
--display-name="gemini-prod-endpoint" \
--network="projects/$PROJECT_ID/global/networks/default" \
--project=$PROJECT_ID
# Capture the endpoint ID for the next steps
export ENDPOINT_ID=$(gcloud ai endpoints list \
--region=$REGION \
--filter="displayName=gemini-prod-endpoint" \
--format="value(name)" | awk -F'/' '{print $NF}')
echo "Endpoint ID: $ENDPOINT_ID"
The --network flag routes traffic privately over VPC. Remove it only for public dev endpoints — never in production.
Step 3: Deploy Gemini 2.0 Flash as the Primary Model
# Deploy Flash as primary (95% traffic) — faster, cheaper, handles most requests
gcloud ai endpoints deploy-model $ENDPOINT_ID \
--region=$REGION \
--model="publishers/google/models/gemini-2.0-flash-001" \
--display-name="gemini-flash-primary" \
--traffic-split="0=95" \
--min-replica-count=2 \
--max-replica-count=20 \
--machine-type="n1-standard-4" \
--project=$PROJECT_ID
Expected: Deployment takes 3–5 minutes. You'll see deployedModelId in the output — save it.
export FLASH_MODEL_ID="<deployedModelId from output>"
If it fails:
Quota exceeded→ Request quota increase forcustom-model-training-us-central1in the Cloud ConsoleModel not found→ Confirm Gemini 2.0 is available in your region:gcloud ai models list --region=$REGION --filter="publisher=google"
Step 4: Deploy Gemini 2.0 Pro for Complex Requests
# Pro handles the 5% of requests needing deeper reasoning
gcloud ai endpoints deploy-model $ENDPOINT_ID \
--region=$REGION \
--model="publishers/google/models/gemini-2.0-pro-exp-02-05" \
--display-name="gemini-pro-fallback" \
--traffic-split="$FLASH_MODEL_ID=95,0=5" \
--min-replica-count=1 \
--max-replica-count=5 \
--machine-type="n1-standard-8" \
--project=$PROJECT_ID
The traffic-split key 0 means "this new model". The existing Flash model ($FLASH_MODEL_ID) gets 95% and Pro gets 5%.
Step 5: Configure Workload Identity for Your App
Never use a downloaded JSON service account key in production. Workload Identity binds your GKE pod's service account directly to a GCP service account.
# Create a dedicated GCP service account for Gemini calls
gcloud iam service-accounts create gemini-inference-sa \
--display-name="Gemini Inference Service Account" \
--project=$PROJECT_ID
# Grant it the Vertex AI User role (predict only — no admin)
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:gemini-inference-sa@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/aiplatform.user"
# Allow your Kubernetes service account to impersonate it
# Replace NAMESPACE and KSA_NAME with your actual values
export NAMESPACE="production"
export KSA_NAME="gemini-app"
gcloud iam service-accounts add-iam-policy-binding \
gemini-inference-sa@$PROJECT_ID.iam.gserviceaccount.com \
--role="roles/iam.workloadIdentityUser" \
--member="serviceAccount:$PROJECT_ID.svc.id.goog[$NAMESPACE/$KSA_NAME]"
Then annotate your Kubernetes service account:
# k8s/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: gemini-app
namespace: production
annotations:
# This is what links the K8s SA to the GCP SA
iam.gke.io/gcp-service-account: gemini-inference-sa@YOUR_PROJECT_ID.iam.gserviceaccount.com
Step 6: Call the Endpoint with Streaming and Retry Logic
# requirements: google-cloud-aiplatform>=1.50.0, tenacity
import vertexai
from vertexai.generative_models import GenerativeModel, GenerationConfig
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from google.api_core.exceptions import ResourceExhausted, ServiceUnavailable
import os
PROJECT_ID = os.environ["GCP_PROJECT_ID"]
REGION = os.environ.get("GCP_REGION", "us-central1")
ENDPOINT_ID = os.environ["VERTEX_ENDPOINT_ID"]
vertexai.init(project=PROJECT_ID, location=REGION)
# Retry on rate limits and transient 5xx — exponential backoff caps at 60s
@retry(
retry=retry_if_exception_type((ResourceExhausted, ServiceUnavailable)),
wait=wait_exponential(multiplier=1, min=2, max=60),
stop=stop_after_attempt(5),
)
def generate_streaming(prompt: str, max_tokens: int = 2048) -> str:
model = GenerativeModel(
# Use endpoint ID to route through your dedicated endpoint
model_name=f"projects/{PROJECT_ID}/locations/{REGION}/endpoints/{ENDPOINT_ID}",
)
config = GenerationConfig(
max_output_tokens=max_tokens,
temperature=0.2, # Lower = more deterministic; good for code and structured output
top_p=0.95,
)
full_response = []
# Stream chunks as they arrive — reduces time-to-first-token for users
for chunk in model.generate_content(
prompt,
generation_config=config,
stream=True,
):
if chunk.text:
full_response.append(chunk.text)
print(chunk.text, end="", flush=True)
return "".join(full_response)
if __name__ == "__main__":
result = generate_streaming(
"Explain the difference between Gemini 2.0 Flash and Pro in 3 bullet points."
)
print(f"\n\nTotal tokens used: {len(result.split())}")
Step 7: Set Up VPC Service Controls (Production Only)
VPC Service Controls prevent data exfiltration — Gemini API calls that don't originate from your VPC are rejected.
# Create an access policy (org-level, do once)
gcloud access-context-manager policies create \
--organization=$ORG_ID \
--title="prod-access-policy"
export POLICY_NAME=$(gcloud access-context-manager policies list \
--organization=$ORG_ID --format="value(name)")
# Create a service perimeter around your AI project
gcloud access-context-manager perimeters create gemini-prod-perimeter \
--policy=$POLICY_NAME \
--title="Gemini Production Perimeter" \
--resources="projects/$PROJECT_NUMBER" \
--restricted-services="aiplatform.googleapis.com" \
--access-levels=""
Important: Test the perimeter in dry-run mode first (--perimeter-type=DRY_RUN) or you'll lock yourself out.
Verification
# Test a prediction directly from gcloud
gcloud ai endpoints predict $ENDPOINT_ID \
--region=$REGION \
--json-request='{
"instances": [{
"content": "What is 2 + 2? Reply with just the number."
}]
}' \
--project=$PROJECT_ID
You should see: A JSON response with "4" in under 2 seconds.
Check autoscaler activity:
# Watch replica count respond to load
gcloud ai operations list \
--region=$REGION \
--filter="metadata.endpointId=$ENDPOINT_ID" \
--sort-by="~createTime" \
--limit=10
Check traffic split is active:
gcloud ai endpoints describe $ENDPOINT_ID \
--region=$REGION \
--format="yaml(trafficSplit)"
You should see: Flash at 95%, Pro at 5%.
Production Considerations
Cost: Gemini 2.0 Flash on a dedicated endpoint with 2 minimum replicas runs roughly $0.35/hr at idle. Set min-replica-count=0 only for dev — cold starts on dedicated endpoints take 45–90 seconds and will timeout user requests.
Quotas: The default online-prediction-requests-per-minute quota is 600 RPM per region. For sustained high throughput, request a quota increase before go-live — not after.
Observability: Enable Vertex AI model monitoring in the Cloud Console under your endpoint. It captures input/output distributions and will alert on prompt injection patterns or unexpected output shifts.
Fallback strategy: If Vertex AI returns a 503, fall back to the generativelanguage.googleapis.com endpoint with a separate API key. Keep the fallback in a circuit breaker — don't let a Vertex outage cascade into unbounded calls to the public endpoint.
What You Learned
- Dedicated endpoints give reserved throughput and private VPC routing — worth the idle cost in production
- Traffic splitting lets you A/B test Flash vs Pro without a code change
- Workload Identity is the correct credential pattern for GKE — no JSON keys
- Stream responses to cut perceived latency; retry only on
ResourceExhaustedandServiceUnavailable
Limitation: Dedicated endpoints don't support gemini-2.0-flash-exp or experimental model IDs — pin to a stable version (-001 suffix) or the endpoint deploy will fail.
Tested on Vertex AI SDK 1.52.0, Python 3.12, GKE 1.29, us-central1 region