Problem: Your CPU Runs Hot and Slow During Local LLM Inference
You're running Mistral, Phi-3, or LLaMA locally and your CPU hits 90°C, throttles, and still takes 5-10 seconds per token. Your laptop fan sounds like a jet engine.
Meanwhile, your chip has a dedicated Neural Processing Unit (NPU) sitting completely idle.
You'll learn:
- How to detect and validate your integrated NPU
- Which runtime to use for your hardware (Intel, AMD, Qualcomm, Apple)
- How to route inference through the NPU with ONNX Runtime or OpenVINO
- What model formats and quantization levels actually work in 2026
Time: 20 min | Level: Intermediate
Why This Happens
Most LLM runtimes — llama.cpp, Ollama, LM Studio — default to CPU or GPU. They don't auto-detect NPUs because NPU drivers, runtimes, and model formats vary wildly by vendor.
Your NPU (Intel AI Boost, Qualcomm Hexagon, Apple Neural Engine, AMD XDNA) is a fixed-function matrix accelerator. It's great at transformer ops: matrix multiplication, attention, softmax. It runs at a fraction of the CPU's wattage.
Common symptoms of CPU-only inference:
- Token generation under 10 tokens/sec on modern hardware
- CPU cores pinned at 100% during inference
- Laptop battery drains in under 2 hours
- Windows Task Manager shows NPU at 0% utilization
This is the problem. NPU at 0%, CPU screaming — we're going to fix this.
Before You Start: Check Your NPU
Not all integrated NPUs are worth routing inference to. Check yours first.
# Windows — check for NPU in Device Manager programmatically
powershell -Command "Get-PnpDevice | Where-Object {$_.FriendlyName -like '*NPU*' -or $_.FriendlyName -like '*Neural*' -or $_.FriendlyName -like '*VPU*'}"
# Linux
lspci | grep -i "neural\|npu\|vpu"
# Or check for Intel VPU
ls /dev/accel*
What you need to proceed:
- Intel Core Ultra (Meteor Lake / Arrow Lake / Lunar Lake) — Intel NPU with AI Boost, 10–48 TOPS
- Qualcomm Snapdragon X Elite/Plus — Hexagon NPU, 45 TOPS
- Apple Silicon (M1+) — Apple Neural Engine (use Core ML path, not this guide)
- AMD Ryzen AI (Phoenix, Strix) — AMD XDNA NPU, 10–50 TOPS
If you have an older chip with no NPU, skip to the iGPU fallback section.
Solution A: Intel NPU with OpenVINO + ONNX Runtime
This path works on Intel Core Ultra (12th gen+). OpenVINO is Intel's inference SDK and has the most mature NPU support.
Step 1: Install Dependencies
pip install openvino==2024.6.0
pip install onnxruntime-openvino
pip install optimum[openvino] # Hugging Face integration
Verify the NPU is visible to OpenVINO:
from openvino import Core
core = Core()
print(core.available_devices)
# Should print: ['CPU', 'GPU', 'NPU']
If NPU is missing: Install the latest Intel NPU driver from intel.com/npu-driver. Reboot, then re-run.
Step 2: Export Your Model to OpenVINO IR
Use Optimum to export a Hugging Face model directly to OpenVINO's Intermediate Representation (IR) format. INT4 quantization is the sweet spot — it fits most 3B–7B models in NPU memory.
# Export Phi-3.5-mini to OpenVINO IR with INT4 weight quantization
optimum-cli export openvino \
--model microsoft/Phi-3.5-mini-instruct \
--weight-format int4 \
--trust-remote-code \
./phi35-mini-ov-int4
This takes 5–10 minutes. You'll get a folder with .xml and .bin files.
If you see OOM during export: Switch to --weight-format int8. INT4 requires more RAM during the conversion process (not during inference).
Step 3: Run Inference on the NPU
from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer
model_path = "./phi35-mini-ov-int4"
# Load with NPU as the device
# HINT: "NPU" routes compute; "CPU" is the fallback for unsupported ops
model = OVModelForCausalLM.from_pretrained(
model_path,
device="NPU", # Route to Intel NPU
ov_config={
"PERFORMANCE_HINT": "LATENCY", # Optimize for fast first token
"NUM_STREAMS": "1", # Single stream = lower latency
"CACHE_DIR": "./ov_cache" # Cache compiled model, avoids recompiling
}
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
inputs = tokenizer("Explain transformer attention in one paragraph:", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Expected first run: Slow (30–60 seconds) while OpenVINO compiles the model for your NPU. Subsequent runs use the cache and start in under 3 seconds.
NPU should spike to 60–90% during token generation. If it stays at 0%, the model fell back to CPU — check your ov_config.
Step 4: Verify NPU Is Actually Being Used
# OpenVINO will silently fall back to CPU for unsupported ops
# Check the compiled model's device assignment
compiled = model.model # Access the underlying CompiledModel
print(compiled.get_property("EXECUTION_DEVICES"))
# Should show: NPU or NPU,CPU (heterogeneous — normal for some ops)
If you see only CPU, the model config or driver is the issue. Try adding "ENABLE_CPU_FALLBACK": "YES" explicitly — some driver versions need it stated.
Solution B: iGPU Fallback with DirectML
No NPU, or NPU doesn't support your model? Use DirectML to run on the integrated GPU. This works on any DirectX 12 GPU — Intel Iris Xe, AMD Radeon 890M, Nvidia.
Step 1: Install ONNX Runtime with DirectML
pip install onnxruntime-directml
Step 2: Export to ONNX
# Export to ONNX with optimum
optimum-cli export onnx \
--model microsoft/Phi-3.5-mini-instruct \
--device cpu \
--trust-remote-code \
./phi35-mini-onnx
Step 3: Run with DirectML EP
import onnxruntime as ort
from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer
# DirectML execution provider routes to iGPU automatically
model = ORTModelForCausalLM.from_pretrained(
"./phi35-mini-onnx",
provider="DmlExecutionProvider", # DirectML — picks best available GPU
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")
inputs = tokenizer("Hello, world:", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Solution C: Qualcomm Snapdragon X with QNN
Snapdragon X Elite/Plus has a Hexagon NPU that requires Qualcomm's QNN runtime.
# Install QNN execution provider (Windows ARM64 only)
pip install onnxruntime-qnn
import onnxruntime as ort
# QNN EP — routes to Hexagon NPU
session_options = ort.SessionOptions()
providers = [
("QNNExecutionProvider", {
"backend_path": "QnnHtp.dll", # HTP = Hexagon Tensor Processor
"htp_performance_mode": "burst", # Max performance mode
"qnn_context_cache_enable": 1, # Cache compiled graph
"qnn_context_cache_path": "./qnn_cache.bin"
}),
"CPUExecutionProvider" # Fallback
]
sess = ort.InferenceSession("model.onnx", sess_options=session_options, providers=providers)
QNN model requirement: Models must be quantized to INT8 or INT4 before running on HTP. Use Qualcomm's AI Hub (aihub.qualcomm.com) to pre-convert popular models — they publish pre-optimized Phi, Mistral, and LLaMA variants.
Verification
Run this benchmark to confirm you got a speedup:
import time
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
prompt = "List 10 programming languages and their primary use cases:"
inputs = tokenizer(prompt, return_tensors="pt")
# Warm-up pass (uses cache)
_ = model.generate(**inputs, max_new_tokens=10)
# Timed run
start = time.perf_counter()
outputs = model.generate(**inputs, max_new_tokens=100)
elapsed = time.perf_counter() - start
tokens_generated = outputs.shape[1] - inputs["input_ids"].shape[1]
print(f"Tokens/sec: {tokens_generated / elapsed:.1f}")
print(f"Time to 100 tokens: {elapsed:.2f}s")
Baseline targets on Intel Core Ultra 7 155H with Phi-3.5-mini INT4:
- CPU only: ~8–12 tok/s
- NPU (OpenVINO): ~18–28 tok/s
- iGPU DirectML (Iris Xe): ~15–22 tok/s
Token throughput comparison. Results vary by model size and quantization level.
What You Learned
- NPUs need vendor-specific runtimes — there's no universal driver
- OpenVINO + INT4 is the most production-ready path for Intel NPUs in 2026
- First-run compile time is normal; always enable model caching
- DirectML is the safest fallback — it works on any DirectX 12 device
- Silent CPU fallback is the #1 gotcha — always verify execution device after loading
Model size limits to know:
- Intel NPU: 3B–7B INT4 models comfortably fit; 13B+ will partially spill to CPU
- Qualcomm HTP: 7B INT4 is the practical ceiling without chunked inference
- iGPU DirectML: Constrained by shared VRAM (typically 4–8GB) — use INT4 for anything above 3B
When NOT to use this:
- Batch inference or server workloads — dedicated GPU is better
- Models above 13B — NPU memory limits force painful tiling
- If your workflow needs streaming with complex sampling — NPU EPs have limited sampler support vs llama.cpp
Tested on Intel Core Ultra 7 155H (Meteor Lake), Qualcomm Snapdragon X Elite, Windows 11 24H2, Ubuntu 24.04. OpenVINO 2024.6, ONNX Runtime 1.20, Optimum 1.21.