Problem: Running Open-Source Models Without Managing GPUs

Replicate API lets you deploy and call open-source models — Llama 3.3, SDXL, Whisper, and 50,000+ others — without provisioning a single GPU. You hit an endpoint, you get a result. No CUDA driver hell.

The catch: the docs scatter Python, Node.js, and webhook examples across three pages, and cold-start behavior surprises developers on the free tier. This guide consolidates everything.

You'll learn:

Authenticate and make your first Replicate API call in under 5 minutes
Run text, image, and audio models with Python and Node.js
Handle async predictions and webhooks for production workloads
Control cost with model versions and USD pricing tiers

Time: 20 min | Difficulty: Intermediate

Why Replicate Exists

Self-hosting a 70B parameter model costs $2–4/hour on a dedicated A100. Most apps don't need dedicated uptime — they need burst inference on demand. Replicate solves this with serverless GPU execution: you pay per second of compute, billed in USD, and the infrastructure scales to zero when idle.

Every model on Replicate runs inside a Cog container — a Docker-based packaging format that pins model weights, Python version, and CUDA version together. This means the model version you call today behaves identically in six months.

Replicate API request lifecycle: client → API → cold/warm container → prediction result Replicate's prediction pipeline: your request routes to a warm or cold Cog container; results stream back or post to a webhook.

Prerequisites

Python 3.11+ or Node.js 20+
A Replicate account — free tier includes $0.005 in compute credit on signup
REPLICATE_API_TOKEN from replicate.com/account/api-tokens

Solution

Step 1: Install the Client

Python (using uv — recommended):

uv add replicate

Python (pip fallback):

pip install replicate

Node.js:

npm install replicate

Set your token as an environment variable — never hardcode it:

export REPLICATE_API_TOKEN=r8_your_token_here

Expected output: No output. Confirm with echo $REPLICATE_API_TOKEN.

Step 2: Run Your First Model (Llama 3.3 70B)

The replicate.run() call is synchronous and blocks until the prediction completes. Use it for scripts and quick tests.

Python:

import replicate

output = replicate.run(
    # Pin the exact version hash — prevents surprise behavior from upstream updates
    "meta/meta-llama-3-70b-instruct",
    input={
        "prompt": "Explain transformer attention in one paragraph.",
        "max_new_tokens": 256,
        "temperature": 0.7,
    }
)

# output is a generator — join chunks as they stream
print("".join(output))

Node.js:

import Replicate from "replicate";

const replicate = new Replicate();

const output = await replicate.run(
  "meta/meta-llama-3-70b-instruct",
  {
    input: {
      prompt: "Explain transformer attention in one paragraph.",
      max_new_tokens: 256,
      temperature: 0.7,
    },
  }
);

// Node client returns an async iterator
for await (const chunk of output) {
  process.stdout.write(chunk);
}

Expected output: Streamed text in your terminal within 2–8 seconds on a warm container.

If it fails:

401 Unauthorized → REPLICATE_API_TOKEN not exported in the current shell session. Run export again.
422 Unprocessable Entity → Invalid input key. Check the model's input schema at replicate.com/meta/meta-llama-3-70b-instruct/api.
Model is booting in the response → Cold start. The free tier can take 30–90 seconds on first call. Retry after the boot completes.

Step 3: Pin a Model Version

Model names like "meta/meta-llama-3-70b-instruct" always resolve to the latest version. In production, pin the version SHA to lock behavior:

output = replicate.run(
    # SHA from the model's "Versions" tab on replicate.com
    "meta/meta-llama-3-70b-instruct:dp-cf05cc8e6e5b3c0bc6e1b6c9f5e4a2d8",
    input={"prompt": "..."}
)

Find any model's version SHA under the Versions tab on its Replicate page.

Step 4: Run an Image Generation Model (SDXL)

Image models return a list of file URLs, not text. The URL expires after 1 hour — download immediately if you need to persist it.

import replicate
import httpx
from pathlib import Path

output = replicate.run(
    "stability-ai/sdxl:7762fd07cf82c948538e41f63f77d685e02b063e37291ef63919ea9f8f6e9b5",
    input={
        "prompt": "A neon-lit Tokyo alley at midnight, cinematic, 8k",
        "negative_prompt": "blurry, low quality, watermark",
        "width": 1024,
        "height": 1024,
        "num_inference_steps": 30,  # 20–30 steps balances quality and cost
        "guidance_scale": 7.5,
    }
)

# output[0] is the image URL
image_url = output[0]
image_bytes = httpx.get(image_url).content
Path("output.png").write_bytes(image_bytes)
print(f"Saved to output.png — source URL expires in 1hr: {image_url}")

Expected output: output.png written to disk, ~2–4 MB WebP or PNG.

Step 5: Run an Audio Model (Whisper)

File inputs use the replicate.files.create() helper or accept a public URL directly:

import replicate

# Option A — pass a public URL (fastest)
output = replicate.run(
    "openai/whisper:4d50797290df275329f202e48c76360b3f22b08d28c196cbc54600319435f8d",
    input={
        "audio": "https://upload.wikimedia.org/wikipedia/commons/7/7c/Agwanta.ogg",
        "model": "large-v3",
        "language": "en",
        "transcription": "plain text",
    }
)
print(output["transcription"])

# Option B — upload a local file
with open("my_audio.mp3", "rb") as f:
    audio_file = replicate.files.create(f)

output = replicate.run(
    "openai/whisper:4d50797290df275329f202e48c76360b3f22b08d28c196cbc54600319435f8d",
    input={"audio": audio_file.urls["get"], "model": "large-v3"}
)

Step 6: Async Predictions for Production

replicate.run() blocks the calling thread — bad for web servers. Use predictions.create() instead and poll or handle via webhook.

Create and poll:

import replicate
import time

prediction = replicate.predictions.create(
    model="meta/meta-llama-3-70b-instruct",
    input={"prompt": "Summarize the Replicate API in 3 bullets."},
)

print(f"Prediction ID: {prediction.id} — Status: {prediction.status}")

# Poll until terminal state
while prediction.status not in ("succeeded", "failed", "canceled"):
    time.sleep(1)
    prediction.reload()

if prediction.status == "succeeded":
    print("".join(prediction.output))
else:
    print(f"Failed: {prediction.error}")

Webhook (recommended for production):

prediction = replicate.predictions.create(
    model="meta/meta-llama-3-70b-instruct",
    input={"prompt": "Write a FastAPI health check endpoint."},
    # Replicate POSTs the completed prediction JSON to this URL
    webhook="https://your-api.us-east-1.example.com/webhooks/replicate",
    webhook_events_filter=["completed"],  # only fire on terminal state
)
print(f"Queued: {prediction.id}")

Your webhook handler receives a JSON body with status, output, error, and metrics (including predict_time in seconds for cost calculation).

Step 7: Build a Minimal FastAPI Endpoint

Wrap Replicate in a FastAPI route for a production-ready inference microservice:

from fastapi import FastAPI, BackgroundTasks, HTTPException
from pydantic import BaseModel
import replicate

app = FastAPI()

class InferRequest(BaseModel):
    prompt: str
    max_tokens: int = 512

class InferResponse(BaseModel):
    prediction_id: str
    status: str

@app.post("/infer", response_model=InferResponse)
async def create_inference(req: InferRequest):
    try:
        prediction = replicate.predictions.create(
            model="meta/meta-llama-3-70b-instruct",
            input={
                "prompt": req.prompt,
                "max_new_tokens": req.max_tokens,
            },
            webhook="https://your-api.us-east-1.example.com/webhooks/replicate",
            webhook_events_filter=["completed"],
        )
        return InferResponse(prediction_id=prediction.id, status=prediction.status)
    except replicate.exceptions.ReplicateError as e:
        raise HTTPException(status_code=502, detail=str(e))

Run with: uvicorn main:app --host 0.0.0.0 --port 8000

Verification

Test your setup end-to-end with this one-liner:

python -c "
import replicate
out = replicate.run('meta/meta-llama-3-70b-instruct', input={'prompt': 'Say OK', 'max_new_tokens': 5})
print(''.join(out))
"

You should see: OK or similar within 5–10 seconds on a warm container.

Check your prediction history and cost at replicate.com/account.

Pricing Reference (USD)

Model tier	GPU	Cost per second
Small (Llama 3 8B, Whisper)	Nvidia T4	~$0.000225/sec
Medium (Llama 3 70B, SDXL)	Nvidia A40	~$0.000725/sec
Large (Llama 3.3 405B)	Nvidia A100	~$0.001150/sec

A typical Llama 3 70B response (2–4 seconds of compute) costs $0.001–$0.003 USD. SDXL image generation at 30 steps runs 8 seconds → **$0.006 USD per image**. The free tier includes $0.005 credit. Paid plans start at pay-as-you-go with no monthly minimum.

US developers: Replicate infrastructure runs in AWS us-east-1 and us-west-2 by default. Latency from the US East Coast is typically 60–120ms to first token on warm containers.

What You Learned

replicate.run() is synchronous and ideal for scripts; use predictions.create() with webhooks for web services
Pin version SHAs in production — model names without SHAs silently upgrade
Image and audio models return URLs, not bytes — download within 1 hour before they expire
Cold starts on free tier can take 30–90 seconds; dedicated deployments at $0.000725/sec eliminate this
Replicate's Cog format guarantees reproducible model behavior across Python and CUDA versions

Tested on Replicate Python client v0.31, Python 3.12, Node.js 22 LTS, macOS Sequoia & Ubuntu 24.04

FAQ

Q: Does Replicate work without a credit card? A: Yes. The free tier gives $0.005 in compute on signup with no card required. Add a card to increase rate limits and access larger models.

Q: What is the difference between replicate.run() and predictions.create()? A: run() blocks synchronously until the prediction finishes — simple but ties up your thread. predictions.create() returns immediately with a prediction ID and lets you poll or receive a webhook, which is the right pattern for API servers.

Q: How much VRAM does Replicate use for Llama 3 70B? A: You don't manage VRAM — Replicate provisions the right GPU (A40, 48GB VRAM) automatically based on the model's Cog config. You pay per second of GPU time.

Q: Can I run a private or fine-tuned model on Replicate? A: Yes. Push a Cog container with cog push and your model appears as a private model in your account. Fine-tuned versions of supported base models (like Llama 3) can be created via Replicate's training API — fine-tune runs start at $0.00115/sec on A100.

Q: Does Replicate support streaming output? A: Yes. Both replicate.run() and predictions.create() support streaming. The Python client returns a generator; iterate it to print tokens as they arrive without waiting for the full response.