Problem: Your Code Shouldn't Leave Your Machine
GitHub Copilot sends every keystroke to Microsoft's servers. That's a dealbreaker for proprietary codebases, compliance requirements, or privacy concerns.
You'll learn:
- Run CodeLlama 34B locally with GPU acceleration
- Integrate with VS Code for real-time suggestions
- Optimize inference speed to under 2 seconds
- Handle context windows for large files
Time: 30 min | Level: Intermediate
Why This Matters
Cloud-based copilots process your code on remote servers, creating risks:
- Legal exposure for licensed or proprietary code
- Latency from network round-trips (500ms+ typical)
- Offline limitations when internet drops
- Cost at scale ($10-20/user/month)
Common symptoms of needing this:
- Legal team blocks Copilot adoption
- Working on air-gapped systems
- Team wants AI coding without subscriptions
- Need faster than cloud responses
Prerequisites
Hardware requirements:
- GPU: NVIDIA with 16GB+ VRAM (RTX 4080/A4000 minimum)
- RAM: 32GB system memory
- Storage: 20GB free space
- OS: Linux or macOS (Windows WSL2 works)
Software:
Check GPU compatibility:
nvidia-smi
# Should show your GPU with driver 535+
Solution
Step 1: Install Ollama
Ollama manages LLM models and serves them via API.
# Linux/macOS
curl -fsSL https://ollama.ai/install.sh | sh
# Verify installation
ollama --version
# Should show: ollama version 0.1.23 or newer
Why Ollama: Handles quantization, GPU offloading, and model caching automatically. Alternative is llama.cpp but requires manual configuration.
If it fails:
- "CUDA not found": Install NVIDIA drivers first (
sudo apt install nvidia-driver-535) - macOS Metal error: Update to macOS 13.3+ for Metal API support
Step 2: Download CodeLlama 34B
# Pull the model (19GB download)
ollama pull codellama:34b-code
# This takes 10-15 minutes on gigabit internet
Model choice: 34B balances quality and speed. Use codellama:13b-code for 8GB GPUs or codellama:70b-code if you have 40GB+ VRAM.
Quantization note: Ollama uses Q4_K_M quantization by default. This reduces 34B from 68GB to 19GB with minimal accuracy loss.
Step 3: Test Inference Speed
# Start server (runs in background)
ollama serve
# In new Terminal, test completion
time ollama run codellama:34b-code "def fibonacci(n):"
Expected output:
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)
Performance targets:
- First token: <500ms (cold start)
- Subsequent tokens: 30-50 tokens/sec
- RTX 4090: ~45 tok/s
- RTX 4080: ~35 tok/s
If slow (<20 tok/s):
- Check GPU utilization:
nvidia-smishould show 90%+ GPU usage - Verify CUDA offload:
ollama psshows "100% GPU" - Reduce context: Add
--ctx-size 2048to commands
Step 4: Install VS Code Extension
# Install Continue.dev extension
code --install-extension continue.continue
Configure Continue (~/.continue/config.json):
{
"models": [
{
"title": "CodeLlama Local",
"provider": "ollama",
"model": "codellama:34b-code",
"apiBase": "http://localhost:11434"
}
],
"tabAutocompleteModel": {
"title": "CodeLlama Local",
"provider": "ollama",
"model": "codellama:34b-code"
},
"allowAnonymousTelemetry": false,
"embeddingsProvider": {
"provider": "ollama",
"model": "nomic-embed-text"
}
}
Why Continue.dev: Open source, supports multiple providers, no telemetry. Alternatives: Tabby (requires separate server) or Codeium (cloud fallback).
Step 5: Configure Context Window
Create ~/.continue/prompts/system.txt:
You are an expert programmer. Provide concise, working code.
Current file language: {{{language}}}
User request: Complete the code logically.
Rules:
- No explanations unless asked
- Match existing code style
- Prefer standard library over dependencies
- Return only code, no markdown fences
Context optimization:
{
"contextLength": 4096,
"completionOptions": {
"maxTokens": 500,
"temperature": 0.2,
"topP": 0.95
}
}
Why these settings:
4096 context: Fits ~150 lines of code with importstemperature 0.2: Deterministic completions (less creativity)maxTokens 500: Prevents runaway generations
Step 6: Test Real-World Completion
Open VS Code, create test.py:
import pandas as pd
def analyze_sales(df: pd.DataFrame):
# Start typing here and press Tab
Expected behavior:
- Suggestions appear in gray text within 1-2 seconds
- Press
Tabto accept,Escto reject - Multi-line suggestions for function bodies
Quality check:
# Should suggest something like:
total_sales = df['amount'].sum()
avg_sale = df['amount'].mean()
return {
'total': total_sales,
'average': avg_sale,
'count': len(df)
}
Verification
Performance test:
# Measure completion latency
time curl http://localhost:11434/api/generate -d '{
"model": "codellama:34b-code",
"prompt": "def quicksort(arr):",
"stream": false
}'
You should see:
- Total time: <2 seconds
- Tokens/sec: 30+
- Memory usage: ~18GB VRAM (check
nvidia-smi)
VS Code integration test:
- Open a Python/TypeScript file
- Type
function calculateTax(and wait - Should suggest parameter types and function body
Optimization Tips
Speed Up Cold Starts
Add to ~/.ollama/config.json:
{
"keep_alive": "24h",
"num_parallel": 2
}
Why: Keeps model in VRAM for 24 hours, reduces first-request latency from 3s to 200ms.
Reduce Memory Usage
Use smaller quantization:
ollama pull codellama:34b-code-q4_0
# Saves 3GB VRAM with slight quality drop
Multi-Language Support
Pull additional models:
ollama pull codellama:34b-python # Python-specific
ollama pull codellama:34b-instruct # General instructions
Configure in Continue:
{
"models": [
{
"title": "Python",
"model": "codellama:34b-python",
"languages": ["python"]
},
{
"title": "General",
"model": "codellama:34b-code",
"languages": ["*"]
}
]
}
What You Learned
- Ollama simplifies local LLM deployment with automatic GPU management
- CodeLlama 34B provides quality comparable to cloud copilots
- Context window tuning balances memory and relevance
- Local inference achieves sub-2-second latency on consumer GPUs
Limitations:
- GPU required: CPU inference is 10-20x slower (unusable for autocomplete)
- No fine-tuning: Model doesn't learn your codebase (yet - RAG coming soon)
- Occasional hallucinations: Review suggestions, don't blindly accept
When NOT to use this:
- Your company already uses Copilot legally (network latency isn't critical)
- No NVIDIA GPU available (AMD/Intel support experimental)
- Team <5 people (setup overhead vs. $10/month/user)
Production Checklist
Security:
- Ollama bound to localhost only (default)
- No telemetry in Continue config (
allowAnonymousTelemetry: false) - Firewall blocks port 11434 from external access
Monitoring:
# Check model memory usage
ollama ps
# View request logs
journalctl -u ollama -f
Team deployment:
- Use Docker Compose with
ollama/ollama:latest - Mount models to persistent volume
- Configure NVIDIA runtime in
docker-compose.yml:
services:
ollama:
image: ollama/ollama:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
volumes:
- ollama-models:/root/.ollama
ports:
- "11434:11434"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Troubleshooting
"Model not found":
ollama list # Check installed models
ollama pull codellama:34b-code # Re-download if missing
Completions timing out:
- Reduce
contextLengthto 2048 in Continue config - Check GPU isn't running other workloads (
nvidia-smi) - Update NVIDIA drivers to 535+ (
nvidia-smishows version)
VS Code suggestions not appearing:
- Open Continue sidebar (Cmd/Ctrl+Shift+P → "Continue: Open")
- Check connection status (should show green dot)
- Verify Ollama running:
curl http://localhost:11434/api/tags
Out of memory errors:
- Switch to 13B model:
ollama pull codellama:13b-code - Close other GPU applications
- Restart Ollama:
sudo systemctl restart ollama
Tested on Ubuntu 22.04, NVIDIA RTX 4090, Ollama 0.1.23, VS Code 1.86, CodeLlama 34B Q4_K_M