WebGPU AI: Running Inference Directly in Chrome Without Servers

Run AI models in the browser using WebGPU and transformers.js — no backend, no API keys, full GPU acceleration in Chrome.

Problem: AI Inference Has a Server Problem

Every time your app calls an AI model, it sends data to a server, waits for a round trip, pays per token, and leaks user data. WebGPU changes this. You can now run real transformer models directly in Chrome — GPU-accelerated, private, offline-capable, and free after the initial download.

You'll learn:

  • How WebGPU exposes GPU compute to JavaScript
  • How to load and run a transformer model in the browser with transformers.js
  • How to benchmark WebGPU vs CPU inference so you can decide when it's worth it

Time: 20 min | Level: Intermediate


Why This Works Now

WebGPU shipped in Chrome 113 (2023) and has matured significantly. It gives JavaScript direct access to the GPU's compute shaders — the same primitives that make CUDA fast — without a native binary.

What changed to make inference viable:

  • Chrome's WebGPU implementation now supports the full WGSL compute shader spec
  • transformers.js v3 added a WebGPU execution provider, replacing the slower WASM fallback
  • Quantized models (INT4, INT8) fit in VRAM without filling the GPU memory budget
  • SharedArrayBuffer + OffscreenCanvas let you run inference in a Web Worker without blocking the main thread

Common symptoms that push people to this:

  • API costs scaling uncomfortably with usage
  • Latency spikes from cold-start serverless functions
  • GDPR/HIPAA constraints that make sending user content to third-party APIs painful

Solution

Step 1: Check WebGPU Availability

Not every device will have WebGPU. Always gate on it.

// Check before loading any model assets
async function getWebGPUDevice(): Promise<GPUDevice | null> {
  if (!navigator.gpu) {
    // WebGPU not supported — fall back to WASM/CPU
    return null;
  }

  const adapter = await navigator.gpu.requestAdapter({
    powerPreference: "high-performance", // Request discrete GPU if available
  });

  if (!adapter) return null;

  return adapter.requestDevice();
}

Expected: Returns a GPUDevice on Chrome 113+ with WebGPU-capable hardware. Returns null on Firefox (WebGPU still behind flag as of early 2026), Safari (partial), and integrated graphics that fail adapter request.

If it fails:

  • navigator.gpu is undefined: Browser doesn't support WebGPU. Serve a WASM fallback.
  • requestAdapter() returns null: GPU driver doesn't expose WebGPU. Common on Linux with older Mesa drivers.

Step 2: Install and Configure transformers.js

npm install @huggingface/transformers

transformers.js v3 wraps ONNX Runtime Web and automatically routes to WebGPU when available.

import { pipeline, env } from "@huggingface/transformers";

// Tell transformers.js to use WebGPU backend
env.backends.onnx.wasm.numThreads = 1; // Disable WASM threads when using WebGPU

// Cache models in IndexedDB so repeat loads are instant
env.cacheDir = "/models"; // Or leave default for browser cache

Step 3: Load a Model in a Web Worker

Running inference on the main thread freezes the UI. Always use a Worker.

// inference.worker.ts
import { pipeline, env } from "@huggingface/transformers";

type InferenceRequest = {
  text: string;
  taskId: string;
};

let classifier: Awaited<ReturnType<typeof pipeline>> | null = null;

async function loadModel() {
  // "device: webgpu" is the key flag — falls back to wasm if unavailable
  classifier = await pipeline(
    "text-classification",
    "Xenova/distilbert-base-uncased-finetuned-sst-2-english",
    { device: "webgpu", dtype: "q8" } // q8 = INT8 quantization (~4x smaller than fp32)
  );
}

self.onmessage = async (event: MessageEvent<InferenceRequest>) => {
  if (!classifier) await loadModel();

  const result = await classifier(event.data.text);

  self.postMessage({ taskId: event.data.taskId, result });
};
// main.ts — spawn the worker
const worker = new Worker(new URL("./inference.worker.ts", import.meta.url), {
  type: "module",
});

function classify(text: string): Promise<unknown> {
  return new Promise((resolve) => {
    const taskId = crypto.randomUUID();

    const handler = (e: MessageEvent) => {
      if (e.data.taskId === taskId) {
        worker.removeEventListener("message", handler);
        resolve(e.data.result);
      }
    };

    worker.addEventListener("message", handler);
    worker.postMessage({ text, taskId });
  });
}

Expected: First call takes 2-8 seconds for model download + compile (WGSL shader compilation). Subsequent calls on the same session are fast — 10-50ms for a distilbert classification.

If it fails:

  • SharedArrayBuffer error: Your server must send Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp headers. WebGPU in Workers requires these.
  • Out of memory: Switch to a smaller model or use dtype: "q4" (INT4). Most consumer GPUs have 4-8GB VRAM shared with the display.

Step 4: Handle the Cold Start

The first inference includes shader compilation, which is slow. Show progress to the user.

// transformers.js fires progress events during model load
classifier = await pipeline(
  "text-classification",
  "Xenova/distilbert-base-uncased-finetuned-sst-2-english",
  {
    device: "webgpu",
    dtype: "q8",
    progress_callback: (progress) => {
      // Post progress back to main thread for a loading bar
      self.postMessage({
        type: "progress",
        status: progress.status, // "download", "progress", "done"
        loaded: progress.loaded,
        total: progress.total,
      });
    },
  }
);

Why this matters: Without feedback, users think the page is frozen during the 2-8s compile. A progress bar sets the right expectation.


Verification

Open Chrome DevTools → Performance → record while running your first inference.

# If you're using Vite, the COOP/COEP headers go in vite.config.ts
# vite.config.ts
export default {
  server: {
    headers: {
      "Cross-Origin-Opener-Policy": "same-origin",
      "Cross-Origin-Embedder-Policy": "require-corp",
    },
  },
};

You should see: In the Performance panel, GPU activity during inference. If you only see CPU activity, WebGPU didn't activate — check the console for the transformers.js backend warning.

To confirm WebGPU is active programmatically:

const session = await classifier.processor?.tokenizer; // Not useful
// Better: log the backend after load
console.log(env.backends.onnx); // Should show "webgpu" as active backend

Chrome DevTools showing GPU activity during inference GPU timeline shows activity during inference — if it's flat, you're on CPU


What You Learned

  • WebGPU exposes GPU compute to JavaScript via navigator.gpu — gate on it before loading models
  • transformers.js v3 routes to WebGPU automatically with device: "webgpu"
  • Always run inference in a Web Worker to avoid blocking the UI
  • COOP/COEP headers are required for SharedArrayBuffer, which Workers need
  • INT8 quantization (dtype: "q8") is the sweet spot — 4x smaller than fp32 with minimal accuracy loss

When NOT to use this:

  • Models larger than ~500MB quantized — download time kills the UX
  • Users on integrated graphics or mobile GPUs — WebGPU adapter requests often fail
  • Tasks needing GPT-4-class reasoning — no model that size runs in a browser today

Limitation: WebGPU shader compilation is per-session. There's no persistent shader cache across page loads yet in Chrome. Each cold start recompiles. This will improve as the spec matures.


Tested on Chrome 132, transformers.js 3.3.x, TypeScript 5.7, Vite 6 — macOS (M2) and Windows 11 (RTX 4070)