Problem: AI Inference Has a Server Problem
Every time your app calls an AI model, it sends data to a server, waits for a round trip, pays per token, and leaks user data. WebGPU changes this. You can now run real transformer models directly in Chrome — GPU-accelerated, private, offline-capable, and free after the initial download.
You'll learn:
- How WebGPU exposes GPU compute to JavaScript
- How to load and run a transformer model in the browser with
transformers.js - How to benchmark WebGPU vs CPU inference so you can decide when it's worth it
Time: 20 min | Level: Intermediate
Why This Works Now
WebGPU shipped in Chrome 113 (2023) and has matured significantly. It gives JavaScript direct access to the GPU's compute shaders — the same primitives that make CUDA fast — without a native binary.
What changed to make inference viable:
- Chrome's WebGPU implementation now supports the full WGSL compute shader spec
transformers.jsv3 added a WebGPU execution provider, replacing the slower WASM fallback- Quantized models (INT4, INT8) fit in VRAM without filling the GPU memory budget
- SharedArrayBuffer + OffscreenCanvas let you run inference in a Web Worker without blocking the main thread
Common symptoms that push people to this:
- API costs scaling uncomfortably with usage
- Latency spikes from cold-start serverless functions
- GDPR/HIPAA constraints that make sending user content to third-party APIs painful
Solution
Step 1: Check WebGPU Availability
Not every device will have WebGPU. Always gate on it.
// Check before loading any model assets
async function getWebGPUDevice(): Promise<GPUDevice | null> {
if (!navigator.gpu) {
// WebGPU not supported — fall back to WASM/CPU
return null;
}
const adapter = await navigator.gpu.requestAdapter({
powerPreference: "high-performance", // Request discrete GPU if available
});
if (!adapter) return null;
return adapter.requestDevice();
}
Expected: Returns a GPUDevice on Chrome 113+ with WebGPU-capable hardware. Returns null on Firefox (WebGPU still behind flag as of early 2026), Safari (partial), and integrated graphics that fail adapter request.
If it fails:
navigator.gpuis undefined: Browser doesn't support WebGPU. Serve a WASM fallback.requestAdapter()returns null: GPU driver doesn't expose WebGPU. Common on Linux with older Mesa drivers.
Step 2: Install and Configure transformers.js
npm install @huggingface/transformers
transformers.js v3 wraps ONNX Runtime Web and automatically routes to WebGPU when available.
import { pipeline, env } from "@huggingface/transformers";
// Tell transformers.js to use WebGPU backend
env.backends.onnx.wasm.numThreads = 1; // Disable WASM threads when using WebGPU
// Cache models in IndexedDB so repeat loads are instant
env.cacheDir = "/models"; // Or leave default for browser cache
Step 3: Load a Model in a Web Worker
Running inference on the main thread freezes the UI. Always use a Worker.
// inference.worker.ts
import { pipeline, env } from "@huggingface/transformers";
type InferenceRequest = {
text: string;
taskId: string;
};
let classifier: Awaited<ReturnType<typeof pipeline>> | null = null;
async function loadModel() {
// "device: webgpu" is the key flag — falls back to wasm if unavailable
classifier = await pipeline(
"text-classification",
"Xenova/distilbert-base-uncased-finetuned-sst-2-english",
{ device: "webgpu", dtype: "q8" } // q8 = INT8 quantization (~4x smaller than fp32)
);
}
self.onmessage = async (event: MessageEvent<InferenceRequest>) => {
if (!classifier) await loadModel();
const result = await classifier(event.data.text);
self.postMessage({ taskId: event.data.taskId, result });
};
// main.ts — spawn the worker
const worker = new Worker(new URL("./inference.worker.ts", import.meta.url), {
type: "module",
});
function classify(text: string): Promise<unknown> {
return new Promise((resolve) => {
const taskId = crypto.randomUUID();
const handler = (e: MessageEvent) => {
if (e.data.taskId === taskId) {
worker.removeEventListener("message", handler);
resolve(e.data.result);
}
};
worker.addEventListener("message", handler);
worker.postMessage({ text, taskId });
});
}
Expected: First call takes 2-8 seconds for model download + compile (WGSL shader compilation). Subsequent calls on the same session are fast — 10-50ms for a distilbert classification.
If it fails:
- SharedArrayBuffer error: Your server must send
Cross-Origin-Opener-Policy: same-originandCross-Origin-Embedder-Policy: require-corpheaders. WebGPU in Workers requires these. - Out of memory: Switch to a smaller model or use
dtype: "q4"(INT4). Most consumer GPUs have 4-8GB VRAM shared with the display.
Step 4: Handle the Cold Start
The first inference includes shader compilation, which is slow. Show progress to the user.
// transformers.js fires progress events during model load
classifier = await pipeline(
"text-classification",
"Xenova/distilbert-base-uncased-finetuned-sst-2-english",
{
device: "webgpu",
dtype: "q8",
progress_callback: (progress) => {
// Post progress back to main thread for a loading bar
self.postMessage({
type: "progress",
status: progress.status, // "download", "progress", "done"
loaded: progress.loaded,
total: progress.total,
});
},
}
);
Why this matters: Without feedback, users think the page is frozen during the 2-8s compile. A progress bar sets the right expectation.
Verification
Open Chrome DevTools → Performance → record while running your first inference.
# If you're using Vite, the COOP/COEP headers go in vite.config.ts
# vite.config.ts
export default {
server: {
headers: {
"Cross-Origin-Opener-Policy": "same-origin",
"Cross-Origin-Embedder-Policy": "require-corp",
},
},
};
You should see: In the Performance panel, GPU activity during inference. If you only see CPU activity, WebGPU didn't activate — check the console for the transformers.js backend warning.
To confirm WebGPU is active programmatically:
const session = await classifier.processor?.tokenizer; // Not useful
// Better: log the backend after load
console.log(env.backends.onnx); // Should show "webgpu" as active backend
GPU timeline shows activity during inference — if it's flat, you're on CPU
What You Learned
- WebGPU exposes GPU compute to JavaScript via
navigator.gpu— gate on it before loading models transformers.jsv3 routes to WebGPU automatically withdevice: "webgpu"- Always run inference in a Web Worker to avoid blocking the UI
- COOP/COEP headers are required for SharedArrayBuffer, which Workers need
- INT8 quantization (
dtype: "q8") is the sweet spot — 4x smaller than fp32 with minimal accuracy loss
When NOT to use this:
- Models larger than ~500MB quantized — download time kills the UX
- Users on integrated graphics or mobile GPUs — WebGPU adapter requests often fail
- Tasks needing GPT-4-class reasoning — no model that size runs in a browser today
Limitation: WebGPU shader compilation is per-session. There's no persistent shader cache across page loads yet in Chrome. Each cold start recompiles. This will improve as the spec matures.
Tested on Chrome 132, transformers.js 3.3.x, TypeScript 5.7, Vite 6 — macOS (M2) and Windows 11 (RTX 4070)