Problem: Your AI Model Runs on CPU Instead of the NPU

You've got a Snapdragon X Elite or X Plus machine. It has a 45 TOPS NPU sitting idle while your model grinds away on the CPU at 10x the latency and power draw.

Qualcomm's toolchain is fragmented across multiple SDKs, and the docs assume you already know which one to use. This guide cuts through that.

You'll learn:

Which Qualcomm SDK to use and how to install it
How to convert a PyTorch or ONNX model to run on the NPU
How to run inference and verify it's actually hitting the NPU

Time: 45 min | Level: Intermediate

Why This Happens

Snapdragon X devices expose three compute units — CPU, GPU, and NPU (Hexagon) — but the OS doesn't route AI workloads to the NPU automatically. You need to compile your model into a format the Hexagon NPU understands: .dlc (Deep Learning Container).

The Qualcomm AI Engine Direct SDK (QAIRT) handles this. It replaced the older QNN and SNPE SDKs in 2024, though you'll still see both names in older tutorials.

Common symptoms:

Task Manager shows 0% NPU utilization during inference
Model runs slower than expected on a "AI PC"
onnxruntime without the QNN execution provider defaults to CPU

Solution

Step 1: Install the Qualcomm AI Engine Direct SDK

Download QAIRT from Qualcomm's developer portal. You'll need a free account.

# Verify you're on a Snapdragon X device first
wmic cpu get name
# Should return: Snapdragon X Elite or Snapdragon X Plus

Run the installer, then set up your environment:

# Windows (PowerShell)
$env:QAIRT_SDK_ROOT = "C:\Qualcomm\AIStack\QAIRT\2.27.0"
$env:Path += ";$env:QAIRT_SDK_ROOT\bin"

# Verify installation
qairt-version
# Expected: QAIRT SDK version 2.27.x

Expected: Version string printed with no errors.

If it fails:

"qairt-version not recognized": Close and reopen PowerShell after setting env vars, or restart your Terminal session.
Installer blocked by antivirus: Temporarily disable real-time protection — the SDK's unsigned binaries trigger false positives.

Step 2: Convert Your Model to DLC Format

QAIRT's qairt-converter handles PyTorch, TensorFlow, and ONNX. ONNX is the most reliable path.

First, export your PyTorch model to ONNX:

import torch
import torch.onnx

model = YourModel()
model.load_state_dict(torch.load("model.pth"))
model.eval()

dummy_input = torch.randn(1, 3, 224, 224)  # Match your model's input shape

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    opset_version=17,          # QAIRT works best with opset 17
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={"input": {0: "batch_size"}}
)

Then convert to DLC:

qairt-converter \
  --input_network model.onnx \
  --output_path model.dlc \
  --input_dim input 1,3,224,224

# Expected output:
# [INFO] Conversion complete: model.dlc
# [INFO] Model size: 23.4 MB

Terminal output showing successful DLC conversion Successful conversion — note the op compatibility summary at the bottom

If it fails:

"Unsupported op: LayerNorm": Add --use_cpu_for_unsupported_ops flag. Unsupported ops fall back to CPU automatically, everything else runs on NPU.
Shape mismatch errors: Double-check --input_dim matches exactly what your model expects, including batch size.

Step 3: Quantize for Maximum NPU Performance

The Hexagon NPU is optimized for INT8 operations. FP32 models run on the NPU but at reduced throughput. Quantization typically gives 3-4x speedup with minimal accuracy loss.

You need a small calibration dataset — 100-500 representative samples:

# prepare_calibration_data.py
import numpy as np

# Generate or load real samples from your dataset
calibration_inputs = []
for i in range(200):
    # Use real data if possible — random data produces poor quantization
    sample = load_real_sample(i)
    calibration_inputs.append(sample)

# Save as raw binary files QAIRT expects
for i, sample in enumerate(calibration_inputs):
    sample.astype(np.float32).tofile(f"cal_data/input_{i}.raw")

Run quantization:

qairt-quantizer \
  --input_dlc model.dlc \
  --output_dlc model_quantized.dlc \
  --input_list cal_data_list.txt \
  --act_bitwidth 8 \
  --weights_bitwidth 8

# cal_data_list.txt lists one .raw file path per line

If accuracy drops more than 2%: Switch to --act_bitwidth 16 for activations only. Keeps most of the speedup while recovering accuracy.

Step 4: Run Inference on the NPU

# infer.py
import qairt_runtime as qrt
import numpy as np

# Load the quantized model
session = qrt.InferenceSession(
    "model_quantized.dlc",
    runtime=qrt.Runtime.DSP,   # DSP = Hexagon NPU
    perf_profile=qrt.PerfProfile.BURST  # Max clock for latency-sensitive tasks
)

# Prepare input — must match the shape from conversion
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)

# Run inference
outputs = session.run(
    input_names=["input"],
    input_values=[input_data]
)

print(f"Output shape: {outputs[0].shape}")
print(f"Latency: {session.last_inference_time_ms:.1f} ms")

Task Manager showing NPU utilization during inference Task Manager Neural Processor view — should spike during inference

Verification

Run the script and check two things: output correctness and actual NPU utilization.

python infer.py

You should see:

Output shape: (1, 1000)
Latency: 4.2 ms

Open Task Manager → Performance → Neural Processor. You should see utilization spike during inference runs.

For a more precise check:

# Log detailed runtime info
qairt-profile \
  --model model_quantized.dlc \
  --runtime DSP \
  --iterations 100 \
  --output profile_report.html

Open profile_report.html to see per-layer execution — layers on NPU are marked HTP, layers falling back to CPU are marked CPU.

QAIRT profiler showing layer execution breakdown Green rows run on the NPU (HTP), yellow fall back to CPU — aim for 90%+ NPU coverage

What You Learned

Snapdragon X NPU requires explicit model compilation to .dlc — no automatic routing
QAIRT SDK is the current unified toolchain; ignore older SNPE/QNN tutorials for new projects
INT8 quantization is essential for full NPU throughput — FP32 works but leaves performance on the table
The --use_cpu_for_unsupported_ops flag handles ops the NPU can't run without failing the whole model

Limitations: QAIRT currently runs only on Windows ARM64 and Linux ARM64. macOS is not supported. The DSP runtime requires a Snapdragon X Elite or X Plus — older Snapdragon 8cx Gen 3 devices use a different runtime path.

When NOT to use this: For models that are mostly transformer attention layers, check NPU coverage with the profiler before committing to quantization — attention ops have mixed NPU support and you may see heavy CPU fallback that erases the latency gains.

Tested on Snapdragon X Elite (X1E-80-100), Windows 11 ARM 24H2, QAIRT SDK 2.27.0, Python 3.12