Problem: Adding AI Models Without Managing Infrastructure

You need to integrate Claude, Llama, or other foundation models into your backend API, but don't want to manage GPU instances, model hosting, or scaling complexity.

You'll learn:

How to call AWS Bedrock models from Node.js/Python backends
Implement streaming responses for real-time UX
Add proper error handling and cost controls
Handle rate limits and retry logic

Time: 20 min | Level: Intermediate

Why Use AWS Bedrock

AWS Bedrock provides serverless access to foundation models from Anthropic, Meta, Amazon, and others through a single API. You pay only for tokens used—no infrastructure management required.

Common use cases:

Chat interfaces in SaaS applications
Document summarization pipelines
Content generation APIs
Code analysis tools

What you need:

AWS account with Bedrock access (request model access in console)
Node.js 20+ or Python 3.11+
Basic understanding of async/await

Solution

Step 1: Enable Model Access

# Check if you have Bedrock access
aws bedrock list-foundation-models --region us-east-1

Expected: JSON list of available models (Claude 3.5 Sonnet, Llama 3.2, etc.)

If it fails:

Error: "AccessDeniedException": Go to AWS Console → Bedrock → Model access → Request access for models you need
No models listed: You're in a region without Bedrock (use us-east-1 or us-west-2)

Step 2: Install SDK and Configure Credentials

Node.js:

npm install @aws-sdk/client-bedrock-runtime

Python:

pip install boto3 --break-system-packages

Configure AWS credentials:

# Option 1: Environment variables (recommended for production)
export AWS_ACCESS_KEY_ID="your-key"
export AWS_SECRET_ACCESS_KEY="your-secret"
export AWS_REGION="us-east-1"

# Option 2: Use IAM role (for EC2/Lambda/ECS)
# No credentials needed - SDK auto-detects

Security note: Never hardcode credentials. Use environment variables or IAM roles.

Step 3: Basic Model Invocation (Node.js)

// bedrock-client.ts
import { 
  BedrockRuntimeClient, 
  InvokeModelCommand 
} from "@aws-sdk/client-bedrock-runtime";

const client = new BedrockRuntimeClient({ region: "us-east-1" });

async function invokeClaude(prompt: string): Promise<string> {
  const payload = {
    anthropic_version: "bedrock-2023-05-31",
    max_tokens: 1024,
    messages: [
      { role: "user", content: prompt }
    ],
    // Cost control: stop generation early
    temperature: 1.0,
    top_p: 0.999
  };

  const command = new InvokeModelCommand({
    modelId: "anthropic.claude-3-5-sonnet-20241022-v2:0",
    contentType: "application/json",
    accept: "application/json",
    body: JSON.stringify(payload)
  });

  try {
    const response = await client.send(command);
    const result = JSON.parse(
      new TextDecoder().decode(response.body)
    );
    
    // Claude returns multiple content blocks
    return result.content[0].text;
    
  } catch (error) {
    if (error.name === "ThrottlingException") {
      // Rate limit hit - implement exponential backoff
      throw new Error("Rate limit exceeded. Retry in 60s");
    }
    throw error;
  }
}

// Usage
const answer = await invokeClaude("Explain async/await in 2 sentences");
console.log(answer);

Why this works: Bedrock uses the standard Messages API format. The SDK handles authentication and request signing automatically.

Step 4: Implement Streaming (Real-Time Responses)

// streaming-client.ts
import { 
  BedrockRuntimeClient, 
  InvokeModelWithResponseStreamCommand 
} from "@aws-sdk/client-bedrock-runtime";

async function streamClaude(
  prompt: string, 
  onChunk: (text: string) => void
): Promise<void> {
  const client = new BedrockRuntimeClient({ region: "us-east-1" });
  
  const payload = {
    anthropic_version: "bedrock-2023-05-31",
    max_tokens: 2048,
    messages: [{ role: "user", content: prompt }]
  };

  const command = new InvokeModelWithResponseStreamCommand({
    modelId: "anthropic.claude-3-5-sonnet-20241022-v2:0",
    contentType: "application/json",
    accept: "application/json",
    body: JSON.stringify(payload)
  });

  const response = await client.send(command);
  
  if (!response.body) {
    throw new Error("No response stream received");
  }

  // Process Server-Sent Events stream
  for await (const event of response.body) {
    if (event.chunk?.bytes) {
      const chunk = JSON.parse(
        new TextDecoder().decode(event.chunk.bytes)
      );
      
      // Extract text delta from content blocks
      if (chunk.type === "content_block_delta") {
        const text = chunk.delta?.text || "";
        onChunk(text); // Send to client immediately
      }
      
      // Handle errors in stream
      if (chunk.type === "error") {
        throw new Error(`Stream error: ${chunk.error.message}`);
      }
    }
  }
}

// Express.js route example
app.post("/api/chat", async (req, res) => {
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  
  await streamClaude(req.body.prompt, (text) => {
    res.write(`data: ${JSON.stringify({ text })}\n\n`);
  });
  
  res.end();
});

Expected: Client receives text chunks as they're generated, creating a typewriter effect.

Performance tip: Streaming reduces perceived latency by 3-5x compared to waiting for full responses.

Step 5: Python Implementation

# bedrock_client.py
import boto3
import json
from typing import Generator

class BedrockClient:
    def __init__(self):
        self.client = boto3.client(
            service_name="bedrock-runtime",
            region_name="us-east-1"
        )
    
    def invoke_claude(self, prompt: str) -> str:
        """Non-streaming invocation"""
        payload = {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 1024,
            "messages": [{"role": "user", "content": prompt}]
        }
        
        response = self.client.invoke_model(
            modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
            contentType="application/json",
            accept="application/json",
            body=json.dumps(payload)
        )
        
        result = json.loads(response["body"].read())
        return result["content"][0]["text"]
    
    def stream_claude(self, prompt: str) -> Generator[str, None, None]:
        """Streaming invocation"""
        payload = {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 2048,
            "messages": [{"role": "user", "content": prompt}]
        }
        
        response = self.client.invoke_model_with_response_stream(
            modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
            contentType="application/json",
            accept="application/json",
            body=json.dumps(payload)
        )
        
        stream = response.get("body")
        if not stream:
            raise ValueError("No response stream")
        
        for event in stream:
            chunk = event.get("chunk")
            if chunk:
                data = json.loads(chunk.get("bytes").decode())
                
                if data["type"] == "content_block_delta":
                    text = data["delta"].get("text", "")
                    yield text
                
                elif data["type"] == "error":
                    raise Exception(f"Stream error: {data['error']}")

# Usage
client = BedrockClient()

# Non-streaming
answer = client.invoke_claude("What is Python's GIL?")
print(answer)

# Streaming
for chunk in client.stream_claude("Explain generators"):
    print(chunk, end="", flush=True)

Why Python: Simpler for data science teams. boto3 handles auth automatically via ~/.aws/credentials or IAM roles.

Step 6: Add Error Handling and Retries

// robust-client.ts
import { setTimeout } from "timers/promises";

async function invokeWithRetry(
  prompt: string, 
  maxRetries = 3
): Promise<string> {
  let lastError: Error;
  
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await invokeClaude(prompt);
      
    } catch (error: any) {
      lastError = error;
      
      // Don't retry client errors
      if (error.$metadata?.httpStatusCode === 400) {
        throw new Error(`Invalid request: ${error.message}`);
      }
      
      // Exponential backoff for rate limits
      if (error.name === "ThrottlingException") {
        const delay = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s
        console.log(`Rate limited. Retrying in ${delay}ms...`);
        await setTimeout(delay);
        continue;
      }
      
      // Retry server errors after short delay
      if (error.$metadata?.httpStatusCode >= 500) {
        await setTimeout(1000);
        continue;
      }
      
      throw error; // Unknown error
    }
  }
  
  throw lastError!;
}

If it fails:

ValidationException: Check your payload format matches model requirements
ResourceNotFoundException: Model ID is wrong or not enabled in your region
ServiceQuotaExceededException: You hit your account limits - request increase in AWS Console

Step 7: Cost Control and Monitoring

// cost-tracking.ts
interface TokenUsage {
  inputTokens: number;
  outputTokens: number;
  estimatedCost: number;
}

async function invokeWithCostTracking(
  prompt: string
): Promise<{ response: string; usage: TokenUsage }> {
  const startTime = Date.now();
  
  const command = new InvokeModelCommand({
    modelId: "anthropic.claude-3-5-sonnet-20241022-v2:0",
    contentType: "application/json",
    accept: "application/json",
    body: JSON.stringify({
      anthropic_version: "bedrock-2023-05-31",
      max_tokens: 1024,
      messages: [{ role: "user", content: prompt }]
    })
  });

  const response = await client.send(command);
  const result = JSON.parse(new TextDecoder().decode(response.body));
  
  // Extract usage from response
  const usage: TokenUsage = {
    inputTokens: result.usage.input_tokens,
    outputTokens: result.usage.output_tokens,
    // Claude 3.5 Sonnet pricing: $3/1M input, $15/1M output
    estimatedCost: (
      (result.usage.input_tokens / 1_000_000 * 3) +
      (result.usage.output_tokens / 1_000_000 * 15)
    )
  };
  
  // Log to CloudWatch or your monitoring system
  console.log({
    model: "claude-3-5-sonnet",
    latency: Date.now() - startTime,
    ...usage
  });
  
  return {
    response: result.content[0].text,
    usage
  };
}

Cost optimization tips:

Set max_tokens to minimum needed (each model has different limits)
Use cheaper models for simple tasks (Haiku for classification, Sonnet for reasoning)
Cache system prompts when possible (not yet available in Bedrock, coming soon)
Monitor usage with AWS Cost Explorer → Filter by "Bedrock"

Verification

Test non-streaming:

node -e "import('./bedrock-client.js').then(m => m.invokeClaude('Say hi'))"

Test streaming:

curl -X POST http://localhost:3000/api/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Count to 5 slowly"}'

You should see:

Non-streaming: Complete response after 2-3 seconds
Streaming: Chunks arriving every 100-300ms

Production Checklist

Security:

Use IAM roles instead of access keys in production
Add request validation (max prompt length, rate limiting per user)
Enable CloudTrail logging for Bedrock API calls
Use VPC endpoints for private network access

Performance:

Implement connection pooling for high-traffic APIs
Add circuit breakers for downstream failures
Cache responses for identical requests (Redis + hash of prompt)
Set timeouts (30s for non-streaming, 60s for streaming)

Cost Management:

Set CloudWatch alarms for monthly spend thresholds
Implement per-user rate limits
Use cheaper models (Haiku) for simple tasks
Log token usage to track expensive prompts

What You Learned

AWS Bedrock provides serverless access to multiple foundation models through one API
Streaming significantly improves perceived performance for end users
Proper error handling and retries are critical for production reliability
Token usage tracking prevents unexpected costs

Limitations:

Bedrock adds ~50-100ms latency vs. direct API calls (AWS network overhead)
Model selection varies by region (us-east-1 has most options)
No prompt caching yet (available in Anthropic API but not Bedrock)

When NOT to use Bedrock:

You need cutting-edge models day-1 (Bedrock lags behind direct APIs by weeks)
You're already heavily invested in another cloud (GCP Vertex, Azure OpenAI)
You need fine-tuned models (limited support currently)

Alternative Models Available

// Other models you can use with same code pattern
const modelIds = {
  // Anthropic
  claudeOpus: "anthropic.claude-3-5-opus-20241022-v2:0",
  claudeSonnet: "anthropic.claude-3-5-sonnet-20241022-v2:0",
  claudeHaiku: "anthropic.claude-3-5-haiku-20241022-v2:0",
  
  // Meta
  llama32_90B: "meta.llama3-2-90b-instruct-v1:0",
  llama32_11B: "meta.llama3-2-11b-instruct-v1:0",
  
  // Amazon
  titanText: "amazon.titan-text-premier-v1:0",
  
  // Mistral
  mistralLarge: "mistral.mistral-large-2407-v1:0"
};

Model selection guide:

Complex reasoning, coding: Claude 3.5 Sonnet/Opus
Fast responses, simple tasks: Claude 3.5 Haiku
Open source preference: Llama 3.2 90B
Budget-conscious: Titan Text (cheapest)

Common Issues

Problem: "ModelNotReadyException"
Solution: Model is still loading (cold start). Retry after 10 seconds.

Problem: "ThrottlingException" on every request
Solution: You hit account limits. Request quota increase: AWS Console → Service Quotas → Bedrock → Tokens per minute

Problem: Streaming stops mid-response
Solution: Response exceeded max_tokens. Increase limit or implement continuation logic.

Problem: High latency (5+ seconds)
Solution: Use streaming or switch to smaller model (Haiku). Check if you're in a region far from Bedrock endpoints.

Tested on AWS Bedrock (Feb 2026), Node.js 22.x, Python 3.12, Claude 3.5 Sonnet v2