Problem: You Don't Know Your Actual LLM Performance
You're running local LLMs on your NVIDIA RTX GPU but have no idea if you're getting good token speeds. Online benchmarks use different models and settings, so the numbers don't match your real-world usage.
You'll learn:
- How to accurately measure tokens/second with llama.cpp and Ollama
- Why your GPU might be underperforming
- Expected speeds for RTX 4060-4090 at different quantization levels
Time: 20 min | Level: Intermediate
Why This Matters
Token generation speed determines your AI workflow efficiency. A slow 10 tokens/sec means waiting 30 seconds for a 300-token response. At 100 tokens/sec, it's instant. Knowing your baseline helps you optimize model choice, quantization, and context length.
Common symptoms of poor performance:
- Model loads but generates text slower than expected
- VRAM usage lower than GPU capacity
- CPU pegged at 100% instead of GPU doing the work
Prerequisites
Hardware:
- NVIDIA RTX 3060 or newer (12GB+ VRAM recommended)
- Latest NVIDIA drivers (545+ for RTX 40 series)
Software:
# Check CUDA availability
nvidia-smi
Expected: GPU listed with CUDA version 12.0+
Solution
Step 1: Install llama.cpp with CUDA
llama.cpp gives you direct access to performance metrics without abstractions.
# Clone and build with CUDA support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_CUDA=1
# Verify CUDA build
./llama-cli --version
Expected: Should show "CUDA support: YES"
If it fails:
- "nvcc not found": Install CUDA toolkit:
sudo apt install nvidia-cuda-toolkit - "CUDA support: NO": Check CUDA_PATH env variable points to CUDA install
Step 2: Download a Test Model
Use Llama 3.1 8B as the baseline - it's well-optimized and widely used.
# Download Q4_K_M quantization (good speed/quality balance)
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
# Verify file size
ls -lh Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
Expected: ~4.9GB file
Step 3: Run Benchmark with Controlled Parameters
# Benchmark with realistic settings
./llama-cli \
-m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
-p "Explain quantum computing in simple terms." \
-n 512 \
-c 2048 \
-ngl 99 \
--verbose-prompt
# -n 512: Generate 512 tokens
# -c 2048: Context window size
# -ngl 99: Offload all layers to GPU
# --verbose-prompt: Show detailed timing
Watch for these metrics in output:
llama_perf_context_print: load time = 2847.85 ms
llama_perf_context_print: prompt eval time = 45.13 ms / 12 tokens (3.76 ms per token)
llama_perf_context_print: eval time = 6234.18 ms / 512 runs (12.18 ms per token)
llama_perf_context_print: total time = 6312.34 ms / 524 tokens
Key metric: eval time shows generation speed. In this example: 512 tokens / 6.234 seconds = 82 tokens/sec
Step 4: Test Multiple Quantization Levels
Different quantizations trade quality for speed. Test what works for your use case.
# Download additional quantizations
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf # Higher quality
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q3_K_M.gguf # Faster
# Create benchmark script
cat > benchmark.sh << 'EOF'
#!/bin/bash
for model in *.gguf; do
echo "Testing: $model"
./llama-cli -m "$model" -p "Write a short poem about AI." -n 256 -c 2048 -ngl 99 2>&1 | grep "eval time"
echo "---"
done
EOF
chmod +x benchmark.sh
./benchmark.sh
Expected output pattern:
Testing: Meta-Llama-3.1-8B-Instruct-Q3_K_M.gguf
eval time = 2847.45 ms / 256 runs (11.12 ms per token) # ~90 tokens/sec
---
Testing: Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
eval time = 3119.23 ms / 256 runs (12.18 ms per token) # ~82 tokens/sec
---
Testing: Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
eval time = 4234.67 ms / 256 runs (16.54 ms per token) # ~60 tokens/sec
Step 5: Benchmark with Ollama (Real-World Usage)
Ollama wraps llama.cpp but adds API overhead. Measure this separately.
# Install Ollama if not present
curl -fsSL https://ollama.com/install.sh | sh
# Pull model
ollama pull llama3.1:8b
# Create Python benchmark script
cat > ollama_bench.py << 'EOF'
import ollama
import time
def benchmark(model, prompt, tokens=512):
start = time.time()
token_count = 0
for chunk in ollama.generate(model=model, prompt=prompt, stream=True):
token_count += 1
if token_count >= tokens:
break
duration = time.time() - start
tps = token_count / duration
print(f"Model: {model}")
print(f"Tokens: {token_count}")
print(f"Time: {duration:.2f}s")
print(f"Speed: {tps:.2f} tokens/sec\n")
benchmark("llama3.1:8b", "Explain how neural networks learn.", 512)
EOF
python3 ollama_bench.py
Expected: 5-10% slower than llama.cpp due to API overhead. RTX 4070 should see ~75-80 tokens/sec.
If it fails:
- "connection refused": Start Ollama service:
ollama serve - Speed < 30 tokens/sec: Check GPU is being used:
nvidia-smiduring generation
Step 6: Document Your Baseline
Create a reference file for your hardware.
cat > my_llm_baseline.txt << EOF
Hardware: $(nvidia-smi --query-gpu=name --format=csv,noheader)
VRAM: $(nvidia-smi --query-gpu=memory.total --format=csv,noheader)
Driver: $(nvidia-smi --query-gpu=driver_version --format=csv,noheader)
Date: $(date +%Y-%m-%d)
Llama 3.1 8B Results:
- Q3_K_M: 90 tokens/sec (3.2GB VRAM)
- Q4_K_M: 82 tokens/sec (4.9GB VRAM)
- Q8_0: 60 tokens/sec (8.1GB VRAM)
Ollama API: 75 tokens/sec (Q4_K_M equivalent)
EOF
cat my_llm_baseline.txt
Verification
Run a final test with your most common use case:
# Long context test (simulates RAG workload)
./llama-cli \
-m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
-p "$(cat large_document.txt) Summarize this." \
-n 512 \
-c 8192 \
-ngl 99
You should see: Token speed stays within 10% of your baseline even with large context.
Expected Performance by GPU
Based on real-world testing (Llama 3.1 8B Q4_K_M):
RTX 4090:
- Tokens/sec: 140-160
- VRAM usage: 4.9GB (24GB total)
- Power draw: ~350W under load
RTX 4080:
- Tokens/sec: 110-125
- VRAM usage: 4.9GB (16GB total)
- Power draw: ~280W
RTX 4070 Ti:
- Tokens/sec: 90-105
- VRAM usage: 4.9GB (12GB total)
- Power draw: ~250W
RTX 4070:
- Tokens/sec: 75-85
- VRAM usage: 4.9GB (12GB total)
- Power draw: ~200W
RTX 4060 Ti (16GB):
- Tokens/sec: 60-70
- VRAM usage: 4.9GB (16GB total)
- Power draw: ~160W
If your numbers are 20%+ lower:
- Check thermal throttling: GPU temp should be < 83°C
- Verify all layers on GPU: Look for "loaded X/X layers" in output
- Update drivers: Performance improved significantly in 545+ series
- Check for CPU bottleneck: CPU usage should be < 30% during generation
What You Learned
- llama.cpp provides accurate token/sec metrics for any GGUF model
- Q4_K_M quantization offers the best speed/quality tradeoff for most uses
- Ollama adds 5-10% overhead but matches llama.cpp for actual generation
- Your baseline helps evaluate if model changes will improve workflow speed
Limitations:
- These benchmarks use synthetic prompts - real workloads vary
- Batch processing (multiple prompts) has different characteristics
- Context window size significantly impacts speed above 8K tokens
Next steps:
- Test your specific models and quantizations
- Benchmark with your typical prompt lengths
- Profile different context window sizes for your use case
Tested on RTX 4070 (12GB), CUDA 12.3, llama.cpp commit 4c5c8f1, Ubuntu 22.04