Problem: Running ML Models Without Server Costs
You need to run image classification or text generation in your web app, but sending data to a server is slow, expensive, and raises privacy concerns.
You'll learn:
- How WebGPU accelerates ML inference 10-100x vs CPU
- Setting up ONNX Runtime Web with WebGPU backend
- Running a real image classifier entirely in-browser
Time: 30 min | Level: Intermediate
Why This Happens
Traditional browser ML uses WebGL (2011 tech) or CPU-only JavaScript. WebGPU (standardized 2023, widely available 2024+) gives direct GPU access with compute shaders, making real-time ML feasible.
Common symptoms:
- TensorFlow.js models run too slow for production
- Can't afford inference API costs at scale
- Privacy-sensitive data can't leave the browser
- Mobile users have terrible ML performance
Reality check: WebGPU gives 20-50x speedup on simple models, but you still can't run LLaMA 70B in a browser. Think image classification, small vision transformers, or quantized language models under 1GB.
Solution
Step 1: Verify WebGPU Support
// Check browser compatibility
async function checkWebGPU() {
if (!navigator.gpu) {
console.error('WebGPU not supported');
return false;
}
const adapter = await navigator.gpu.requestAdapter();
if (!adapter) {
console.error('No GPU adapter found');
return false;
}
console.log('WebGPU ready:', adapter.info);
return true;
}
checkWebGPU();
Expected: Console shows GPU adapter info (vendor, architecture)
If it fails:
- Chrome < 113: Update browser (WebGPU stable since Chrome 113)
- Firefox: Enable
dom.webgpu.enabledin about:config (experimental) - Safari: Available in Safari 18+ (macOS Sonoma+)
Current support (Feb 2026): Chrome/Edge 113+, Safari 18+, Firefox behind flag
Step 2: Install ONNX Runtime Web
npm install onnxruntime-web
Why ONNX: Model format supported by PyTorch, TensorFlow, scikit-learn. ONNX Runtime Web has the best WebGPU integration as of 2026.
Alternatives:
- TensorFlow.js (good WebGPU support, larger bundle size)
- Transformers.js (great for NLP, limited vision models)
Step 3: Convert Your Model to ONNX
# Example: Export PyTorch ResNet to ONNX
import torch
import torchvision.models as models
model = models.resnet50(pretrained=True)
model.eval()
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
model,
dummy_input,
"resnet50.onnx",
opset_version=17, # Use latest opset for WebGPU ops
input_names=['input'],
output_names=['output'],
dynamic_axes={
'input': {0: 'batch_size'},
'output': {0: 'batch_size'}
}
)
Why opset 17: WebGPU backend supports ops from ONNX opset 13+, but 17 includes optimizations for attention mechanisms
Model size warning: Keep under 100MB for reasonable load times. Use quantization for larger models.
Step 4: Load and Run Inference
import * as ort from 'onnxruntime-web';
// Enable WebGPU backend (falls back to WASM if unavailable)
ort.env.wasm.wasmPaths = 'https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/';
async function runInference(imageData) {
// Load model once, reuse for multiple inferences
const session = await ort.InferenceSession.create('./resnet50.onnx', {
executionProviders: ['webgpu', 'wasm'], // Fallback order
graphOptimizationLevel: 'all'
});
// Preprocess image to tensor
const tensor = preprocessImage(imageData);
// Run inference
const feeds = { input: tensor };
const results = await session.run(feeds);
// Get predictions
const output = results.output.data; // Float32Array of logits
return getTopPredictions(output, 5);
}
function preprocessImage(imageData) {
// Resize to 224x224, normalize to ImageNet stats
const canvas = document.createElement('canvas');
canvas.width = 224;
canvas.height = 224;
const ctx = canvas.getContext('2d');
ctx.drawImage(imageData, 0, 0, 224, 224);
const pixels = ctx.getImageData(0, 0, 224, 224).data;
const float32Data = new Float32Array(3 * 224 * 224);
// Convert RGBA to RGB, normalize
for (let i = 0; i < 224 * 224; i++) {
float32Data[i] = (pixels[i * 4] / 255 - 0.485) / 0.229; // R
float32Data[224 * 224 + i] = (pixels[i * 4 + 1] / 255 - 0.456) / 0.224; // G
float32Data[2 * 224 * 224 + i] = (pixels[i * 4 + 2] / 255 - 0.406) / 0.225; // B
}
return new ort.Tensor('float32', float32Data, [1, 3, 224, 224]);
}
function getTopPredictions(logits, k = 5) {
// Apply softmax and get top-k
const probabilities = softmax(logits);
const indices = Array.from(probabilities.keys())
.sort((a, b) => probabilities[b] - probabilities[a])
.slice(0, k);
return indices.map(i => ({
class: IMAGENET_CLASSES[i],
probability: probabilities[i]
}));
}
function softmax(arr) {
const max = Math.max(...arr);
const exp = arr.map(x => Math.exp(x - max));
const sum = exp.reduce((a, b) => a + b);
return exp.map(x => x / sum);
}
Why this works: ONNX Runtime detects WebGPU, compiles compute shaders for each operation, runs on GPU. Falls back to WebAssembly CPU if WebGPU unavailable.
Performance: ResNet-50 inference ~15-30ms on modern GPUs (M1/M2, RTX 30-series) vs 300-500ms on CPU
Step 5: Optimize for Production
// Cache session globally, load once at app start
let cachedSession = null;
async function initModel() {
if (cachedSession) return cachedSession;
cachedSession = await ort.InferenceSession.create('./resnet50.onnx', {
executionProviders: ['webgpu', 'wasm'],
graphOptimizationLevel: 'all',
executionMode: 'parallel', // Use multiple threads
logSeverityLevel: 3 // Only errors in production
});
return cachedSession;
}
// Warm up model (first inference is slower due to shader compilation)
async function warmup() {
const session = await initModel();
const dummyTensor = new ort.Tensor('float32', new Float32Array(1 * 3 * 224 * 224), [1, 3, 224, 224]);
await session.run({ input: dummyTensor });
console.log('Model warmed up');
}
// Call on page load
warmup();
Why warm up: First inference compiles GPU shaders (50-200ms). Subsequent runs use cached shaders.
Bundle optimization:
// Use CDN for ONNX Runtime WASM files (saves ~8MB from bundle)
ort.env.wasm.wasmPaths = 'https://cdn.jsdelivr.net/npm/onnxruntime-web@1.17.0/dist/';
Verification
Test WebGPU Performance
async function benchmark() {
const session = await initModel();
const testTensor = new ort.Tensor('float32', new Float32Array(1 * 3 * 224 * 224), [1, 3, 224, 224]);
const runs = 100;
const start = performance.now();
for (let i = 0; i < runs; i++) {
await session.run({ input: testTensor });
}
const elapsed = performance.now() - start;
console.log(`Average inference time: ${elapsed / runs}ms`);
}
benchmark();
You should see:
- WebGPU: 15-30ms per inference (desktop), 30-60ms (mobile)
- CPU fallback: 300-500ms per inference
If slow:
- Check DevTools → Performance → GPU usage during inference
- Ensure model loaded from cache (Network tab shows 0ms)
- Try smaller model (MobileNetV3 is 5x faster, 2% less accurate)
Real-World Example: Image Classifier UI
<!DOCTYPE html>
<html>
<head>
<title>WebGPU Image Classifier</title>
</head>
<body>
<input type="file" id="imageInput" accept="image/*">
<canvas id="preview" width="224" height="224"></canvas>
<div id="results"></div>
<script type="module">
import * as ort from 'https://cdn.jsdelivr.net/npm/onnxruntime-web@1.17.0/dist/esm/ort.min.js';
let session = null;
// Initialize model
async function init() {
ort.env.wasm.wasmPaths = 'https://cdn.jsdelivr.net/npm/onnxruntime-web@1.17.0/dist/';
session = await ort.InferenceSession.create('./resnet50.onnx', {
executionProviders: ['webgpu', 'wasm']
});
console.log('Ready');
}
// Handle image upload
document.getElementById('imageInput').onchange = async (e) => {
const file = e.target.files[0];
const img = await loadImage(file);
// Show preview
const canvas = document.getElementById('preview');
const ctx = canvas.getContext('2d');
ctx.drawImage(img, 0, 0, 224, 224);
// Classify
const tensor = preprocessImage(img);
const results = await session.run({ input: tensor });
const predictions = getTopPredictions(results.output.data, 5);
// Display results
document.getElementById('results').innerHTML = predictions
.map(p => `<div>${p.class}: ${(p.probability * 100).toFixed(1)}%</div>`)
.join('');
};
function loadImage(file) {
return new Promise((resolve) => {
const img = new Image();
img.onload = () => resolve(img);
img.src = URL.createObjectURL(file);
});
}
// (Include preprocessImage and getTopPredictions from Step 4)
init();
</script>
</body>
</html>
Test it: Upload a photo, should see predictions in <50ms (WebGPU) or <500ms (CPU fallback)
What You Learned
- WebGPU enables real-time ML inference in browsers (20-50x faster than CPU)
- ONNX Runtime Web supports most PyTorch/TensorFlow models with minimal conversion
- Always provide CPU fallback (WebGPU support still growing)
- First inference is slow (shader compilation), cache sessions
Limitations:
- Model size: Keep under 100MB for reasonable UX
- Browser support: Chrome/Safari only, Firefox experimental
- Memory: Large models can crash mobile browsers
- Not for training: Inference only, use PyTorch/JAX for training
When NOT to use this:
- LLMs over 1-2B parameters (use server or quantization)
- Models requiring custom ops not in ONNX
- Legacy browser support required
Production Checklist
- Add loading spinner during model initialization
- Show CPU fallback warning if WebGPU unavailable
- Implement progressive loading for large models
- Add error handling for OOM crashes
- Test on mobile Safari and Chrome
- Monitor bundle size (<500KB for ONNX Runtime)
- Add Web Worker for non-blocking inference
- Implement model caching with Service Worker
Resources
Official Docs:
Model Hubs:
- ONNX Model Zoo - Pre-converted models
- Hugging Face ONNX - NLP/Vision models
Performance:
- Quantize to INT8: 4x smaller, 2-3x faster, <1% accuracy loss
- Use MobileNet/EfficientNet for mobile
- Split large models into chunks for progressive loading
Tested on Chrome 121, Safari 18, ONNX Runtime Web 1.17, macOS M2 & Windows RTX 4070