Build a Private Code Copilot in 30 Minutes with CodeLlama

Run your own AI coding assistant locally with CodeLlama 34B. Get autocomplete without sending code to external APIs.

Problem: Your Code Shouldn't Leave Your Machine

GitHub Copilot sends every keystroke to Microsoft's servers. That's a dealbreaker for proprietary codebases, compliance requirements, or privacy concerns.

You'll learn:

  • Run CodeLlama 34B locally with GPU acceleration
  • Integrate with VS Code for real-time suggestions
  • Optimize inference speed to under 2 seconds
  • Handle context windows for large files

Time: 30 min | Level: Intermediate


Why This Matters

Cloud-based copilots process your code on remote servers, creating risks:

  • Legal exposure for licensed or proprietary code
  • Latency from network round-trips (500ms+ typical)
  • Offline limitations when internet drops
  • Cost at scale ($10-20/user/month)

Common symptoms of needing this:

  • Legal team blocks Copilot adoption
  • Working on air-gapped systems
  • Team wants AI coding without subscriptions
  • Need faster than cloud responses

Prerequisites

Hardware requirements:

  • GPU: NVIDIA with 16GB+ VRAM (RTX 4080/A4000 minimum)
  • RAM: 32GB system memory
  • Storage: 20GB free space
  • OS: Linux or macOS (Windows WSL2 works)

Software:

Check GPU compatibility:

nvidia-smi
# Should show your GPU with driver 535+

Solution

Step 1: Install Ollama

Ollama manages LLM models and serves them via API.

# Linux/macOS
curl -fsSL https://ollama.ai/install.sh | sh

# Verify installation
ollama --version
# Should show: ollama version 0.1.23 or newer

Why Ollama: Handles quantization, GPU offloading, and model caching automatically. Alternative is llama.cpp but requires manual configuration.

If it fails:

  • "CUDA not found": Install NVIDIA drivers first (sudo apt install nvidia-driver-535)
  • macOS Metal error: Update to macOS 13.3+ for Metal API support

Step 2: Download CodeLlama 34B

# Pull the model (19GB download)
ollama pull codellama:34b-code

# This takes 10-15 minutes on gigabit internet

Model choice: 34B balances quality and speed. Use codellama:13b-code for 8GB GPUs or codellama:70b-code if you have 40GB+ VRAM.

Quantization note: Ollama uses Q4_K_M quantization by default. This reduces 34B from 68GB to 19GB with minimal accuracy loss.


Step 3: Test Inference Speed

# Start server (runs in background)
ollama serve

# In new Terminal, test completion
time ollama run codellama:34b-code "def fibonacci(n):"

Expected output:

def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)

Performance targets:

  • First token: <500ms (cold start)
  • Subsequent tokens: 30-50 tokens/sec
  • RTX 4090: ~45 tok/s
  • RTX 4080: ~35 tok/s

If slow (<20 tok/s):

  • Check GPU utilization: nvidia-smi should show 90%+ GPU usage
  • Verify CUDA offload: ollama ps shows "100% GPU"
  • Reduce context: Add --ctx-size 2048 to commands

Step 4: Install VS Code Extension

# Install Continue.dev extension
code --install-extension continue.continue

Configure Continue (~/.continue/config.json):

{
  "models": [
    {
      "title": "CodeLlama Local",
      "provider": "ollama",
      "model": "codellama:34b-code",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "CodeLlama Local",
    "provider": "ollama",
    "model": "codellama:34b-code"
  },
  "allowAnonymousTelemetry": false,
  "embeddingsProvider": {
    "provider": "ollama",
    "model": "nomic-embed-text"
  }
}

Why Continue.dev: Open source, supports multiple providers, no telemetry. Alternatives: Tabby (requires separate server) or Codeium (cloud fallback).


Step 5: Configure Context Window

Create ~/.continue/prompts/system.txt:

You are an expert programmer. Provide concise, working code.
Current file language: {{{language}}}
User request: Complete the code logically.

Rules:
- No explanations unless asked
- Match existing code style
- Prefer standard library over dependencies
- Return only code, no markdown fences

Context optimization:

{
  "contextLength": 4096,
  "completionOptions": {
    "maxTokens": 500,
    "temperature": 0.2,
    "topP": 0.95
  }
}

Why these settings:

  • 4096 context: Fits ~150 lines of code with imports
  • temperature 0.2: Deterministic completions (less creativity)
  • maxTokens 500: Prevents runaway generations

Step 6: Test Real-World Completion

Open VS Code, create test.py:

import pandas as pd

def analyze_sales(df: pd.DataFrame):
    # Start typing here and press Tab

Expected behavior:

  • Suggestions appear in gray text within 1-2 seconds
  • Press Tab to accept, Esc to reject
  • Multi-line suggestions for function bodies

Quality check:

# Should suggest something like:
    total_sales = df['amount'].sum()
    avg_sale = df['amount'].mean()
    return {
        'total': total_sales,
        'average': avg_sale,
        'count': len(df)
    }

Verification

Performance test:

# Measure completion latency
time curl http://localhost:11434/api/generate -d '{
  "model": "codellama:34b-code",
  "prompt": "def quicksort(arr):",
  "stream": false
}'

You should see:

  • Total time: <2 seconds
  • Tokens/sec: 30+
  • Memory usage: ~18GB VRAM (check nvidia-smi)

VS Code integration test:

  1. Open a Python/TypeScript file
  2. Type function calculateTax( and wait
  3. Should suggest parameter types and function body

Optimization Tips

Speed Up Cold Starts

Add to ~/.ollama/config.json:

{
  "keep_alive": "24h",
  "num_parallel": 2
}

Why: Keeps model in VRAM for 24 hours, reduces first-request latency from 3s to 200ms.

Reduce Memory Usage

Use smaller quantization:

ollama pull codellama:34b-code-q4_0
# Saves 3GB VRAM with slight quality drop

Multi-Language Support

Pull additional models:

ollama pull codellama:34b-python  # Python-specific
ollama pull codellama:34b-instruct # General instructions

Configure in Continue:

{
  "models": [
    {
      "title": "Python",
      "model": "codellama:34b-python",
      "languages": ["python"]
    },
    {
      "title": "General",
      "model": "codellama:34b-code",
      "languages": ["*"]
    }
  ]
}

What You Learned

  • Ollama simplifies local LLM deployment with automatic GPU management
  • CodeLlama 34B provides quality comparable to cloud copilots
  • Context window tuning balances memory and relevance
  • Local inference achieves sub-2-second latency on consumer GPUs

Limitations:

  • GPU required: CPU inference is 10-20x slower (unusable for autocomplete)
  • No fine-tuning: Model doesn't learn your codebase (yet - RAG coming soon)
  • Occasional hallucinations: Review suggestions, don't blindly accept

When NOT to use this:

  • Your company already uses Copilot legally (network latency isn't critical)
  • No NVIDIA GPU available (AMD/Intel support experimental)
  • Team <5 people (setup overhead vs. $10/month/user)

Production Checklist

Security:

  • Ollama bound to localhost only (default)
  • No telemetry in Continue config (allowAnonymousTelemetry: false)
  • Firewall blocks port 11434 from external access

Monitoring:

# Check model memory usage
ollama ps

# View request logs
journalctl -u ollama -f

Team deployment:

  • Use Docker Compose with ollama/ollama:latest
  • Mount models to persistent volume
  • Configure NVIDIA runtime in docker-compose.yml:
services:
  ollama:
    image: ollama/ollama:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    volumes:
      - ollama-models:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Troubleshooting

"Model not found":

ollama list  # Check installed models
ollama pull codellama:34b-code  # Re-download if missing

Completions timing out:

  • Reduce contextLength to 2048 in Continue config
  • Check GPU isn't running other workloads (nvidia-smi)
  • Update NVIDIA drivers to 535+ (nvidia-smi shows version)

VS Code suggestions not appearing:

  1. Open Continue sidebar (Cmd/Ctrl+Shift+P → "Continue: Open")
  2. Check connection status (should show green dot)
  3. Verify Ollama running: curl http://localhost:11434/api/tags

Out of memory errors:

  • Switch to 13B model: ollama pull codellama:13b-code
  • Close other GPU applications
  • Restart Ollama: sudo systemctl restart ollama

Tested on Ubuntu 22.04, NVIDIA RTX 4090, Ollama 0.1.23, VS Code 1.86, CodeLlama 34B Q4_K_M