Compile llama.cpp: CPU, CUDA, and Metal Backends 2026

Compiling llama.cpp from source gives you full control over which acceleration backend runs your models — CPU-only for portability, CUDA for NVIDIA GPUs, or Metal for Apple Silicon. Pre-built binaries often lag behind releases by days and can miss hardware-specific tuning flags.

You'll learn:

Build llama.cpp on Ubuntu 24, Windows 11, and macOS (M1–M4)
Enable CUDA on NVIDIA cards and Metal on Apple Silicon
Verify each backend is actually being used, not silently falling back to CPU

Time: 25 min | Difficulty: Intermediate

Why Build From Source?

Pre-built releases ship with conservative defaults. CUDA builds require matching your local CUDA toolkit version. Metal builds for Apple Silicon need the right macOS SDK. Getting either wrong at runtime means silent CPU fallback — full 8-bit inference at 3 tokens/sec instead of 60.

Symptoms of a misconfigured build:

ggml_cuda_init: no CUDA devices found in logs despite a working GPU
llama_new_context_with_model: n_ctx_per_seq (512) < n_ctx (4096) with no GPU layers loaded
metal: GPU not supported on M-series Mac running a non-Metal binary
Token generation speed under 5 t/s on hardware that should do 40+

Backend Overview

llama.cpp CPU CUDA Metal backend compilation workflow llama.cpp build paths: each CMake flag activates a different GGML compute backend at link time

Backend	Flag	Hardware	Platform
CPU (default)	(none needed)	Any x86-64 / ARM	Linux, Windows, macOS
CUDA	`-DGGML_CUDA=ON`	NVIDIA RTX / Tesla / A-series	Linux, Windows
Metal	`-DGGML_METAL=ON`	Apple M1 / M2 / M3 / M4	macOS 13+
Vulkan	`-DGGML_VULKAN=ON`	AMD / Intel Arc	Linux, Windows
ROCm	`-DGGML_HIPBLAS=ON`	AMD RX 7000-series	Linux

This guide covers CPU, CUDA, and Metal. Vulkan and ROCm follow the same CMake pattern — swap the flag.

Prerequisites

All Platforms

git 2.40+
cmake 3.21+
C++17-capable compiler: GCC 12+ (Linux), MSVC 2022+ (Windows), Clang 15+ (macOS)

CUDA (Linux / Windows)

NVIDIA driver 525+ — check with nvidia-smi
CUDA Toolkit 12.x — download from developer.nvidia.com (free, no account needed for toolkit)
nvcc on PATH — verify with nvcc --version

Metal (macOS)

macOS 13 Ventura or newer
Xcode Command Line Tools: xcode-select --install
No extra GPU SDK needed — Metal is bundled with macOS

Step 1: Clone the Repository

# Always clone the main branch — tagged releases lag behind CUDA/Metal fixes
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Check the latest commit date:

git log --oneline -5

Expected output: Five recent commits with dates within the last few days. If you see months-old hashes, you have a stale mirror — re-clone from the canonical URL above.

Step 2: CPU-Only Build (All Platforms)

CPU build compiles everywhere with zero extra dependencies. Use this to verify your toolchain before adding GPU flags.

Linux / macOS

cmake -B build \
  -DCMAKE_BUILD_TYPE=Release    # Release applies -O3; Debug is 4–5× slower
cmake --build build --config Release -j$(nproc)

Windows (PowerShell)

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j $env:NUMBER_OF_PROCESSORS

Expected output (last lines):

[100%] Linking CXX executable llama-cli
[100%] Built target llama-cli

Binaries land in build/bin/. Run a quick sanity check:

./build/bin/llama-cli --version

Step 3: CUDA Build (NVIDIA GPUs — Linux / Windows)

Verify CUDA Toolkit First

nvcc --version
# nvcc: NVIDIA (R) Cuda compiler driver, release 12.4
nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv,noheader
# NVIDIA GeForce RTX 4090, 551.23, 24564 MiB

If nvcc is missing, install the CUDA Toolkit. On Ubuntu 24:

# CUDA 12.4 on Ubuntu 24 — match your driver's max supported CUDA version
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install -y cuda-toolkit-12-4
echo 'export PATH=/usr/local/cuda-12.4/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

Build with CUDA

cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA=ON               # Main flag — activates cuBLAS matrix multiply kernels
cmake --build build --config Release -j$(nproc)

Optional CUDA tuning flags:

cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_F16=ON \          # Half-precision — 15–20% faster on Ampere+ (RTX 30/40-series)
  -DCMAKE_CUDA_ARCHITECTURES="89"  # sm_89 = RTX 40-series; skip older arch JIT compilation

CUDA architecture codes: 75 = Turing (RTX 20xx) · 86 = Ampere (RTX 30xx) · 89 = Ada Lovelace (RTX 40xx) · 90 = Hopper (H100).

Expected build output includes:

-- CUDA found
-- cuBLAS found
-- Using CUDA architectures: 89

Verify GPU is Being Used

./build/bin/llama-cli \
  -m path/to/model.gguf \
  -p "Hello" \
  -n 10 \
  --n-gpu-layers 99   # Offload all layers to GPU; reduce if OOM

Look for this in the output:

llm_load_tensors: offloaded 32/32 layers to GPU
llm_load_tensors: VRAM used: 7842 MiB

If you see offloaded 0/32, the binary was linked without CUDA — rebuild from a clean build/ directory:

rm -rf build && cmake -B build -DGGML_CUDA=ON ...

Step 4: Metal Build (Apple Silicon — macOS)

Metal is the default on macOS when building with Clang. You still need to pass the flag explicitly on macOS 13+ to ensure the Metal shader cache is compiled at build time rather than lazily at first run.

cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_METAL=ON              # Compiles Metal shaders into the binary at build time
cmake --build build --config Release -j$(sysctl -n hw.logicalcpu)

Expected output includes:

-- Metal framework found
-- Using Metal

Verify Metal GPU Usage

./build/bin/llama-cli \
  -m path/to/model.gguf \
  -p "Hello" \
  -n 10 \
  --n-gpu-layers 1   # Start with 1 layer to confirm Metal path activates

Look for:

ggml_metal_init: GPU name: Apple M3 Pro
ggml_metal_init: recommendedMaxWorkingSetSize = 18.00 GB
llm_load_tensors: offloaded 1/32 layers to GPU

Then raise --n-gpu-layers 99 to offload the full model. M2 Pro with 16GB unified memory can offload a full Llama 3.2 8B Q4_K_M (≈4.9 GB) with headroom to spare.

Step 5: Run a Model End-to-End

Download a small model to verify your build:

# Llama 3.2 1B — 800 MB, fast to download, good smoke test
wget https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf

Run inference:

./build/bin/llama-cli \
  -m Llama-3.2-1B-Instruct-Q4_K_M.gguf \
  -p "Explain what llama.cpp does in two sentences." \
  -n 80 \
  --n-gpu-layers 99 \
  --threads 8          # CPU threads for layers not on GPU; match physical core count

Benchmark your build:

./build/bin/llama-bench \
  -m Llama-3.2-1B-Instruct-Q4_K_M.gguf \
  --n-gpu-layers 99

Expected throughput ballpark (prompt processing / generation):

Hardware	Prompt t/s	Generation t/s
RTX 4090 (CUDA, Q4_K_M 8B)	~3800	~120
M3 Pro 18GB (Metal, Q4_K_M 8B)	~900	~65
Ryzen 9 7950X (CPU-only, Q4_K_M 8B)	~180	~12

What You Learned

cmake -B build separates source and build directories — always use this over in-source builds
-DGGML_CUDA=ON links cuBLAS; without it the binary silently uses CPU even on a GPU machine
--n-gpu-layers 99 is how you actually push layers to GPU at runtime — the build flag enables the capability, the CLI flag uses it
Clean the build/ directory when switching backends — CMake caches flags and a partial CUDA cache causes subtle failures

Tested on llama.cpp b5150, Ubuntu 24.04 + CUDA 12.4 + RTX 4080, macOS 14.4 + M3 Pro, Windows 11 + CUDA 12.4 + RTX 4070 Ti

FAQ

Q: Do I need to rebuild if I update llama.cpp with git pull? A: Yes. Run cmake --build build --config Release -j$(nproc) again from the repo root. CMake only recompiles changed files, so incremental builds are fast — usually under 90 seconds.

Q: What's the difference between -DGGML_CUDA=ON and -DLLAMA_CUBLAS=ON? A: LLAMA_CUBLAS was the old flag used before the codebase was restructured to ggml in late 2024. It no longer works on b4000+. Always use -DGGML_CUDA=ON on current builds.

Q: How much VRAM do I need to offload a full 7B model? A: A Q4_K_M quantized 7B model needs roughly 4.5–5 GB of VRAM. An 8 GB card (RTX 3070, RTX 4060) fits it with room for context. Use --n-gpu-layers 20 to partially offload if you're short on VRAM.

Q: Can I build CUDA and Metal into the same binary? A: No. They are mutually exclusive backends. Build separate binaries and use whichever matches the machine you're running on.

Q: Does the CPU build use AVX2/AVX-512 automatically? A: Yes — llama.cpp's CMake detects your host CPU features at configure time and enables AVX2 or AVX-512 VNNI where available. You can override with -DGGML_AVX2=OFF if cross-compiling for older hardware.