Offload LLM Inference from CPU to Integrated NPU in 20 Minutes

Problem: Your CPU Runs Hot and Slow During Local LLM Inference

You're running Mistral, Phi-3, or LLaMA locally and your CPU hits 90°C, throttles, and still takes 5-10 seconds per token. Your laptop fan sounds like a jet engine.

Meanwhile, your chip has a dedicated Neural Processing Unit (NPU) sitting completely idle.

You'll learn:

How to detect and validate your integrated NPU
Which runtime to use for your hardware (Intel, AMD, Qualcomm, Apple)
How to route inference through the NPU with ONNX Runtime or OpenVINO
What model formats and quantization levels actually work in 2026

Time: 20 min | Level: Intermediate

Why This Happens

Most LLM runtimes — llama.cpp, Ollama, LM Studio — default to CPU or GPU. They don't auto-detect NPUs because NPU drivers, runtimes, and model formats vary wildly by vendor.

Your NPU (Intel AI Boost, Qualcomm Hexagon, Apple Neural Engine, AMD XDNA) is a fixed-function matrix accelerator. It's great at transformer ops: matrix multiplication, attention, softmax. It runs at a fraction of the CPU's wattage.

Common symptoms of CPU-only inference:

Token generation under 10 tokens/sec on modern hardware
CPU cores pinned at 100% during inference
Laptop battery drains in under 2 hours
Windows Task Manager shows NPU at 0% utilization

Task Manager showing NPU idle at 0% while CPU runs at 100% during LLM inference This is the problem. NPU at 0%, CPU screaming — we're going to fix this.

Before You Start: Check Your NPU

Not all integrated NPUs are worth routing inference to. Check yours first.

# Windows — check for NPU in Device Manager programmatically
powershell -Command "Get-PnpDevice | Where-Object {$_.FriendlyName -like '*NPU*' -or $_.FriendlyName -like '*Neural*' -or $_.FriendlyName -like '*VPU*'}"

# Linux
lspci | grep -i "neural\|npu\|vpu"
# Or check for Intel VPU
ls /dev/accel*

What you need to proceed:

Intel Core Ultra (Meteor Lake / Arrow Lake / Lunar Lake) — Intel NPU with AI Boost, 10–48 TOPS
Qualcomm Snapdragon X Elite/Plus — Hexagon NPU, 45 TOPS
Apple Silicon (M1+) — Apple Neural Engine (use Core ML path, not this guide)
AMD Ryzen AI (Phoenix, Strix) — AMD XDNA NPU, 10–50 TOPS

If you have an older chip with no NPU, skip to the iGPU fallback section.

Solution A: Intel NPU with OpenVINO + ONNX Runtime

This path works on Intel Core Ultra (12th gen+). OpenVINO is Intel's inference SDK and has the most mature NPU support.

Step 1: Install Dependencies

pip install openvino==2024.6.0
pip install onnxruntime-openvino
pip install optimum[openvino]  # Hugging Face integration

Verify the NPU is visible to OpenVINO:

from openvino import Core
core = Core()
print(core.available_devices)
# Should print: ['CPU', 'GPU', 'NPU']

If NPU is missing: Install the latest Intel NPU driver from intel.com/npu-driver. Reboot, then re-run.

Step 2: Export Your Model to OpenVINO IR

Use Optimum to export a Hugging Face model directly to OpenVINO's Intermediate Representation (IR) format. INT4 quantization is the sweet spot — it fits most 3B–7B models in NPU memory.

# Export Phi-3.5-mini to OpenVINO IR with INT4 weight quantization
optimum-cli export openvino \
  --model microsoft/Phi-3.5-mini-instruct \
  --weight-format int4 \
  --trust-remote-code \
  ./phi35-mini-ov-int4

This takes 5–10 minutes. You'll get a folder with .xml and .bin files.

If you see OOM during export: Switch to --weight-format int8. INT4 requires more RAM during the conversion process (not during inference).

Step 3: Run Inference on the NPU

from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer

model_path = "./phi35-mini-ov-int4"

# Load with NPU as the device
# HINT: "NPU" routes compute; "CPU" is the fallback for unsupported ops
model = OVModelForCausalLM.from_pretrained(
    model_path,
    device="NPU",           # Route to Intel NPU
    ov_config={
        "PERFORMANCE_HINT": "LATENCY",   # Optimize for fast first token
        "NUM_STREAMS": "1",              # Single stream = lower latency
        "CACHE_DIR": "./ov_cache"        # Cache compiled model, avoids recompiling
    }
)

tokenizer = AutoTokenizer.from_pretrained(model_path)

inputs = tokenizer("Explain transformer attention in one paragraph:", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Expected first run: Slow (30–60 seconds) while OpenVINO compiles the model for your NPU. Subsequent runs use the cache and start in under 3 seconds.

NPU utilization spike in Task Manager during OpenVINO inference NPU should spike to 60–90% during token generation. If it stays at 0%, the model fell back to CPU — check your ov_config.

Step 4: Verify NPU Is Actually Being Used

# OpenVINO will silently fall back to CPU for unsupported ops
# Check the compiled model's device assignment
compiled = model.model  # Access the underlying CompiledModel
print(compiled.get_property("EXECUTION_DEVICES"))
# Should show: NPU or NPU,CPU (heterogeneous — normal for some ops)

If you see only CPU, the model config or driver is the issue. Try adding "ENABLE_CPU_FALLBACK": "YES" explicitly — some driver versions need it stated.

Solution B: iGPU Fallback with DirectML

No NPU, or NPU doesn't support your model? Use DirectML to run on the integrated GPU. This works on any DirectX 12 GPU — Intel Iris Xe, AMD Radeon 890M, Nvidia.

Step 1: Install ONNX Runtime with DirectML

pip install onnxruntime-directml

Step 2: Export to ONNX

# Export to ONNX with optimum
optimum-cli export onnx \
  --model microsoft/Phi-3.5-mini-instruct \
  --device cpu \
  --trust-remote-code \
  ./phi35-mini-onnx

Step 3: Run with DirectML EP

import onnxruntime as ort
from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer

# DirectML execution provider routes to iGPU automatically
model = ORTModelForCausalLM.from_pretrained(
    "./phi35-mini-onnx",
    provider="DmlExecutionProvider",  # DirectML — picks best available GPU
)

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")
inputs = tokenizer("Hello, world:", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Solution C: Qualcomm Snapdragon X with QNN

Snapdragon X Elite/Plus has a Hexagon NPU that requires Qualcomm's QNN runtime.

# Install QNN execution provider (Windows ARM64 only)
pip install onnxruntime-qnn

import onnxruntime as ort

# QNN EP — routes to Hexagon NPU
session_options = ort.SessionOptions()
providers = [
    ("QNNExecutionProvider", {
        "backend_path": "QnnHtp.dll",   # HTP = Hexagon Tensor Processor
        "htp_performance_mode": "burst", # Max performance mode
        "qnn_context_cache_enable": 1,  # Cache compiled graph
        "qnn_context_cache_path": "./qnn_cache.bin"
    }),
    "CPUExecutionProvider"  # Fallback
]

sess = ort.InferenceSession("model.onnx", sess_options=session_options, providers=providers)

QNN model requirement: Models must be quantized to INT8 or INT4 before running on HTP. Use Qualcomm's AI Hub (aihub.qualcomm.com) to pre-convert popular models — they publish pre-optimized Phi, Mistral, and LLaMA variants.

Verification

Run this benchmark to confirm you got a speedup:

import time
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_path)
prompt = "List 10 programming languages and their primary use cases:"
inputs = tokenizer(prompt, return_tensors="pt")

# Warm-up pass (uses cache)
_ = model.generate(**inputs, max_new_tokens=10)

# Timed run
start = time.perf_counter()
outputs = model.generate(**inputs, max_new_tokens=100)
elapsed = time.perf_counter() - start

tokens_generated = outputs.shape[1] - inputs["input_ids"].shape[1]
print(f"Tokens/sec: {tokens_generated / elapsed:.1f}")
print(f"Time to 100 tokens: {elapsed:.2f}s")

Baseline targets on Intel Core Ultra 7 155H with Phi-3.5-mini INT4:

CPU only: ~8–12 tok/s
NPU (OpenVINO): ~18–28 tok/s
iGPU DirectML (Iris Xe): ~15–22 tok/s

Before and after benchmark comparing CPU vs NPU token generation speed Token throughput comparison. Results vary by model size and quantization level.

What You Learned

NPUs need vendor-specific runtimes — there's no universal driver
OpenVINO + INT4 is the most production-ready path for Intel NPUs in 2026
First-run compile time is normal; always enable model caching
DirectML is the safest fallback — it works on any DirectX 12 device
Silent CPU fallback is the #1 gotcha — always verify execution device after loading

Model size limits to know:

Intel NPU: 3B–7B INT4 models comfortably fit; 13B+ will partially spill to CPU
Qualcomm HTP: 7B INT4 is the practical ceiling without chunked inference
iGPU DirectML: Constrained by shared VRAM (typically 4–8GB) — use INT4 for anything above 3B

When NOT to use this:

Batch inference or server workloads — dedicated GPU is better
Models above 13B — NPU memory limits force painful tiling
If your workflow needs streaming with complex sampling — NPU EPs have limited sampler support vs llama.cpp

Tested on Intel Core Ultra 7 155H (Meteor Lake), Qualcomm Snapdragon X Elite, Windows 11 24H2, Ubuntu 24.04. OpenVINO 2024.6, ONNX Runtime 1.20, Optimum 1.21.