Problem: Adding AI Models Without Managing Infrastructure
You need to integrate Claude, Llama, or other foundation models into your backend API, but don't want to manage GPU instances, model hosting, or scaling complexity.
You'll learn:
- How to call AWS Bedrock models from Node.js/Python backends
- Implement streaming responses for real-time UX
- Add proper error handling and cost controls
- Handle rate limits and retry logic
Time: 20 min | Level: Intermediate
Why Use AWS Bedrock
AWS Bedrock provides serverless access to foundation models from Anthropic, Meta, Amazon, and others through a single API. You pay only for tokens used—no infrastructure management required.
Common use cases:
- Chat interfaces in SaaS applications
- Document summarization pipelines
- Content generation APIs
- Code analysis tools
What you need:
- AWS account with Bedrock access (request model access in console)
- Node.js 20+ or Python 3.11+
- Basic understanding of async/await
Solution
Step 1: Enable Model Access
# Check if you have Bedrock access
aws bedrock list-foundation-models --region us-east-1
Expected: JSON list of available models (Claude 3.5 Sonnet, Llama 3.2, etc.)
If it fails:
- Error: "AccessDeniedException": Go to AWS Console → Bedrock → Model access → Request access for models you need
- No models listed: You're in a region without Bedrock (use us-east-1 or us-west-2)
Step 2: Install SDK and Configure Credentials
Node.js:
npm install @aws-sdk/client-bedrock-runtime
pip install boto3 --break-system-packages
Configure AWS credentials:
# Option 1: Environment variables (recommended for production)
export AWS_ACCESS_KEY_ID="your-key"
export AWS_SECRET_ACCESS_KEY="your-secret"
export AWS_REGION="us-east-1"
# Option 2: Use IAM role (for EC2/Lambda/ECS)
# No credentials needed - SDK auto-detects
Security note: Never hardcode credentials. Use environment variables or IAM roles.
Step 3: Basic Model Invocation (Node.js)
// bedrock-client.ts
import {
BedrockRuntimeClient,
InvokeModelCommand
} from "@aws-sdk/client-bedrock-runtime";
const client = new BedrockRuntimeClient({ region: "us-east-1" });
async function invokeClaude(prompt: string): Promise<string> {
const payload = {
anthropic_version: "bedrock-2023-05-31",
max_tokens: 1024,
messages: [
{ role: "user", content: prompt }
],
// Cost control: stop generation early
temperature: 1.0,
top_p: 0.999
};
const command = new InvokeModelCommand({
modelId: "anthropic.claude-3-5-sonnet-20241022-v2:0",
contentType: "application/json",
accept: "application/json",
body: JSON.stringify(payload)
});
try {
const response = await client.send(command);
const result = JSON.parse(
new TextDecoder().decode(response.body)
);
// Claude returns multiple content blocks
return result.content[0].text;
} catch (error) {
if (error.name === "ThrottlingException") {
// Rate limit hit - implement exponential backoff
throw new Error("Rate limit exceeded. Retry in 60s");
}
throw error;
}
}
// Usage
const answer = await invokeClaude("Explain async/await in 2 sentences");
console.log(answer);
Why this works: Bedrock uses the standard Messages API format. The SDK handles authentication and request signing automatically.
Step 4: Implement Streaming (Real-Time Responses)
// streaming-client.ts
import {
BedrockRuntimeClient,
InvokeModelWithResponseStreamCommand
} from "@aws-sdk/client-bedrock-runtime";
async function streamClaude(
prompt: string,
onChunk: (text: string) => void
): Promise<void> {
const client = new BedrockRuntimeClient({ region: "us-east-1" });
const payload = {
anthropic_version: "bedrock-2023-05-31",
max_tokens: 2048,
messages: [{ role: "user", content: prompt }]
};
const command = new InvokeModelWithResponseStreamCommand({
modelId: "anthropic.claude-3-5-sonnet-20241022-v2:0",
contentType: "application/json",
accept: "application/json",
body: JSON.stringify(payload)
});
const response = await client.send(command);
if (!response.body) {
throw new Error("No response stream received");
}
// Process Server-Sent Events stream
for await (const event of response.body) {
if (event.chunk?.bytes) {
const chunk = JSON.parse(
new TextDecoder().decode(event.chunk.bytes)
);
// Extract text delta from content blocks
if (chunk.type === "content_block_delta") {
const text = chunk.delta?.text || "";
onChunk(text); // Send to client immediately
}
// Handle errors in stream
if (chunk.type === "error") {
throw new Error(`Stream error: ${chunk.error.message}`);
}
}
}
}
// Express.js route example
app.post("/api/chat", async (req, res) => {
res.setHeader("Content-Type", "text/event-stream");
res.setHeader("Cache-Control", "no-cache");
await streamClaude(req.body.prompt, (text) => {
res.write(`data: ${JSON.stringify({ text })}\n\n`);
});
res.end();
});
Expected: Client receives text chunks as they're generated, creating a typewriter effect.
Performance tip: Streaming reduces perceived latency by 3-5x compared to waiting for full responses.
Step 5: Python Implementation
# bedrock_client.py
import boto3
import json
from typing import Generator
class BedrockClient:
def __init__(self):
self.client = boto3.client(
service_name="bedrock-runtime",
region_name="us-east-1"
)
def invoke_claude(self, prompt: str) -> str:
"""Non-streaming invocation"""
payload = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 1024,
"messages": [{"role": "user", "content": prompt}]
}
response = self.client.invoke_model(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
contentType="application/json",
accept="application/json",
body=json.dumps(payload)
)
result = json.loads(response["body"].read())
return result["content"][0]["text"]
def stream_claude(self, prompt: str) -> Generator[str, None, None]:
"""Streaming invocation"""
payload = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 2048,
"messages": [{"role": "user", "content": prompt}]
}
response = self.client.invoke_model_with_response_stream(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
contentType="application/json",
accept="application/json",
body=json.dumps(payload)
)
stream = response.get("body")
if not stream:
raise ValueError("No response stream")
for event in stream:
chunk = event.get("chunk")
if chunk:
data = json.loads(chunk.get("bytes").decode())
if data["type"] == "content_block_delta":
text = data["delta"].get("text", "")
yield text
elif data["type"] == "error":
raise Exception(f"Stream error: {data['error']}")
# Usage
client = BedrockClient()
# Non-streaming
answer = client.invoke_claude("What is Python's GIL?")
print(answer)
# Streaming
for chunk in client.stream_claude("Explain generators"):
print(chunk, end="", flush=True)
Why Python: Simpler for data science teams. boto3 handles auth automatically via ~/.aws/credentials or IAM roles.
Step 6: Add Error Handling and Retries
// robust-client.ts
import { setTimeout } from "timers/promises";
async function invokeWithRetry(
prompt: string,
maxRetries = 3
): Promise<string> {
let lastError: Error;
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await invokeClaude(prompt);
} catch (error: any) {
lastError = error;
// Don't retry client errors
if (error.$metadata?.httpStatusCode === 400) {
throw new Error(`Invalid request: ${error.message}`);
}
// Exponential backoff for rate limits
if (error.name === "ThrottlingException") {
const delay = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s
console.log(`Rate limited. Retrying in ${delay}ms...`);
await setTimeout(delay);
continue;
}
// Retry server errors after short delay
if (error.$metadata?.httpStatusCode >= 500) {
await setTimeout(1000);
continue;
}
throw error; // Unknown error
}
}
throw lastError!;
}
If it fails:
- ValidationException: Check your payload format matches model requirements
- ResourceNotFoundException: Model ID is wrong or not enabled in your region
- ServiceQuotaExceededException: You hit your account limits - request increase in AWS Console
Step 7: Cost Control and Monitoring
// cost-tracking.ts
interface TokenUsage {
inputTokens: number;
outputTokens: number;
estimatedCost: number;
}
async function invokeWithCostTracking(
prompt: string
): Promise<{ response: string; usage: TokenUsage }> {
const startTime = Date.now();
const command = new InvokeModelCommand({
modelId: "anthropic.claude-3-5-sonnet-20241022-v2:0",
contentType: "application/json",
accept: "application/json",
body: JSON.stringify({
anthropic_version: "bedrock-2023-05-31",
max_tokens: 1024,
messages: [{ role: "user", content: prompt }]
})
});
const response = await client.send(command);
const result = JSON.parse(new TextDecoder().decode(response.body));
// Extract usage from response
const usage: TokenUsage = {
inputTokens: result.usage.input_tokens,
outputTokens: result.usage.output_tokens,
// Claude 3.5 Sonnet pricing: $3/1M input, $15/1M output
estimatedCost: (
(result.usage.input_tokens / 1_000_000 * 3) +
(result.usage.output_tokens / 1_000_000 * 15)
)
};
// Log to CloudWatch or your monitoring system
console.log({
model: "claude-3-5-sonnet",
latency: Date.now() - startTime,
...usage
});
return {
response: result.content[0].text,
usage
};
}
Cost optimization tips:
- Set
max_tokensto minimum needed (each model has different limits) - Use cheaper models for simple tasks (Haiku for classification, Sonnet for reasoning)
- Cache system prompts when possible (not yet available in Bedrock, coming soon)
- Monitor usage with AWS Cost Explorer → Filter by "Bedrock"
Verification
Test non-streaming:
node -e "import('./bedrock-client.js').then(m => m.invokeClaude('Say hi'))"
Test streaming:
curl -X POST http://localhost:3000/api/chat \
-H "Content-Type: application/json" \
-d '{"prompt": "Count to 5 slowly"}'
You should see:
- Non-streaming: Complete response after 2-3 seconds
- Streaming: Chunks arriving every 100-300ms
Production Checklist
Security:
- Use IAM roles instead of access keys in production
- Add request validation (max prompt length, rate limiting per user)
- Enable CloudTrail logging for Bedrock API calls
- Use VPC endpoints for private network access
Performance:
- Implement connection pooling for high-traffic APIs
- Add circuit breakers for downstream failures
- Cache responses for identical requests (Redis + hash of prompt)
- Set timeouts (30s for non-streaming, 60s for streaming)
Cost Management:
- Set CloudWatch alarms for monthly spend thresholds
- Implement per-user rate limits
- Use cheaper models (Haiku) for simple tasks
- Log token usage to track expensive prompts
What You Learned
- AWS Bedrock provides serverless access to multiple foundation models through one API
- Streaming significantly improves perceived performance for end users
- Proper error handling and retries are critical for production reliability
- Token usage tracking prevents unexpected costs
Limitations:
- Bedrock adds ~50-100ms latency vs. direct API calls (AWS network overhead)
- Model selection varies by region (us-east-1 has most options)
- No prompt caching yet (available in Anthropic API but not Bedrock)
When NOT to use Bedrock:
- You need cutting-edge models day-1 (Bedrock lags behind direct APIs by weeks)
- You're already heavily invested in another cloud (GCP Vertex, Azure OpenAI)
- You need fine-tuned models (limited support currently)
Alternative Models Available
// Other models you can use with same code pattern
const modelIds = {
// Anthropic
claudeOpus: "anthropic.claude-3-5-opus-20241022-v2:0",
claudeSonnet: "anthropic.claude-3-5-sonnet-20241022-v2:0",
claudeHaiku: "anthropic.claude-3-5-haiku-20241022-v2:0",
// Meta
llama32_90B: "meta.llama3-2-90b-instruct-v1:0",
llama32_11B: "meta.llama3-2-11b-instruct-v1:0",
// Amazon
titanText: "amazon.titan-text-premier-v1:0",
// Mistral
mistralLarge: "mistral.mistral-large-2407-v1:0"
};
Model selection guide:
- Complex reasoning, coding: Claude 3.5 Sonnet/Opus
- Fast responses, simple tasks: Claude 3.5 Haiku
- Open source preference: Llama 3.2 90B
- Budget-conscious: Titan Text (cheapest)
Common Issues
Problem: "ModelNotReadyException"
Solution: Model is still loading (cold start). Retry after 10 seconds.
Problem: "ThrottlingException" on every request
Solution: You hit account limits. Request quota increase: AWS Console → Service Quotas → Bedrock → Tokens per minute
Problem: Streaming stops mid-response
Solution: Response exceeded max_tokens. Increase limit or implement continuation logic.
Problem: High latency (5+ seconds)
Solution: Use streaming or switch to smaller model (Haiku). Check if you're in a region far from Bedrock endpoints.
Tested on AWS Bedrock (Feb 2026), Node.js 22.x, Python 3.12, Claude 3.5 Sonnet v2