Problem: Your Code Is Training Someone Else's AI

You want AI autocomplete and chat in your editor, but sending proprietary code to GitHub Copilot, Claude, or ChatGPT violates your company's data policy or makes you uncomfortable.

You'll learn:

How to run AI models locally on your machine
Set up VS Code with private AI completion
Configure chat and code generation without cloud APIs
Performance tradeoffs between local and cloud models

Time: 20 min | Level: Intermediate

Why This Matters

Cloud-based AI coding tools send your code to external servers for processing. This creates privacy risks: your proprietary logic, API keys, and business rules become training data. Many companies ban these tools entirely.

Common concerns:

Code snippets sent to cloud providers for inference
Potential data retention in training datasets
Compliance violations (GDPR, HIPAA, SOC 2)
Offline work impossible without internet

Local AI models solve this by running inference on your hardware. Your code never leaves your machine.

Solution

Step 1: Install Ollama

Ollama runs open-source LLMs locally with a simple API.

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows (PowerShell as Admin)
winget install Ollama.Ollama

Expected: Installation completes, ollama command available.

Start the Ollama service:

ollama serve

If it fails:

Port 11434 already in use: Check if Ollama is already running with ps aux | grep ollama
macOS: "Unidentified developer": Go to System Settings → Privacy & Security → Allow

Step 2: Download a Coding Model

Pull a model optimized for code. DeepSeek Coder and CodeLlama work well for most tasks.

# Best for code completion (6.7GB)
ollama pull deepseek-coder:6.7b

# Alternative: lighter model (3.8GB)  
ollama pull codellama:7b

# Best for chat/explanations (requires 16GB+ RAM)
ollama pull qwen2.5-coder:14b

Why these models: DeepSeek Coder 6.7B matches Copilot quality for autocomplete on hardware with 8GB+ RAM. Qwen 2.5 Coder handles complex refactoring better but needs more memory.

Expected: Download progress shown, model ready when complete.

Step 3: Install Continue Extension

Continue is an open-source VS Code extension that connects to local models.

Open VS Code
Install "Continue" extension from marketplace
Click Continue icon in sidebar (⌘/Ctrl+L to open chat)

First run: Continue shows a config file. We'll set this up next.

Step 4: Configure Continue for Local Models

Open Continue settings (click gear icon in Continue sidebar), or edit ~/.continue/config.json:

{
  "models": [
    {
      "title": "DeepSeek Coder",
      "provider": "ollama",
      "model": "deepseek-coder:6.7b",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "DeepSeek Autocomplete", 
    "provider": "ollama",
    "model": "deepseek-coder:6.7b",
    "apiBase": "http://localhost:11434"
  },
  "embeddingsProvider": {
    "provider": "ollama",
    "model": "nomic-embed-text",
    "apiBase": "http://localhost:11434"
  }
}

Why this config:

models: Chat interface for asking questions
tabAutocompleteModel: Inline suggestions as you type
embeddingsProvider: Code context search (optional but helpful)

Pull the embedding model:

ollama pull nomic-embed-text

Expected: Continue now uses your local models. No API keys needed.

Step 5: Test It Out

Autocomplete test: Type a function signature and pause. Continue suggests completions within 1-2 seconds.

def calculate_fibonacci(n: int) -> int:
    # Pause here - suggestion appears

Chat test:
Open Continue chat (⌘/Ctrl+L), ask: "Explain this function" while viewing code.

If completions are slow:

>3 seconds delay: Model too large for RAM. Use codellama:7b instead
No suggestions: Check Ollama is running: curl http://localhost:11434/api/tags
Errors in Continue: View Output panel, select "Continue" to see logs

Verification

Create a test file:

// Type this and pause after the colon
function sortUsers(users: User[])

You should see: Inline completion suggestion within 2 seconds, no network requests in browser DevTools.

Check network isolation:

# Monitor network while using Continue
sudo tcpdump -i any host api.openai.com or host api.anthropic.com

# Should show: No packets captured

Performance Comparison

Local (DeepSeek Coder 6.7B):

Latency: 1-2 seconds per completion
Quality: ~85% of Copilot on simple tasks
Cost: Free, uses ~6GB RAM
Works offline: Yes

Cloud (GitHub Copilot):

Latency: 200-500ms per completion
Quality: Baseline reference
Cost: $10/month
Works offline: No

When local models struggle:

Complex refactoring across files (use cloud for this)
Explaining unfamiliar frameworks (cloud has more training data)
Long-form documentation generation

Use hybrid: local for daily work, cloud API for complex tasks via separate tool.

Advanced: Multiple Models

Add specialized models for different tasks:

{
  "models": [
    {
      "title": "Chat (Qwen)",
      "provider": "ollama", 
      "model": "qwen2.5-coder:14b"
    },
    {
      "title": "Quick Autocomplete (CodeLlama)",
      "provider": "ollama",
      "model": "codellama:7b"  
    }
  ]
}

Switch between them in Continue dropdown. Use lighter models for autocomplete, heavier for chat.

What You Learned

Ollama runs open-source LLMs locally with zero configuration
Continue provides Copilot-like features without cloud dependencies
6.7B parameter models match cloud quality for most coding tasks
Privacy and offline work come at the cost of 1-2 second latency

Limitation: Local models need 8GB+ RAM. Laptops with 4GB won't run smoothly.

When NOT to use local:

You have fast internet and no privacy concerns
Need cutting-edge model quality (GPT-4 class)
Working on a low-RAM machine (<8GB)

Troubleshooting

Ollama won't start:

# Check if service is running
ollama list

# Restart service
# macOS/Linux
killall ollama && ollama serve

# Windows  
Restart-Service Ollama

Out of memory errors:

# Use smaller model
ollama pull codellama:7b

# Or increase swap (Linux)
sudo fallocate -l 16G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Continue can't connect:

# Verify Ollama API is accessible
curl http://localhost:11434/api/tags

# Should return JSON with available models

Hardware Requirements

Minimum:

8GB RAM
10GB free disk space
CPU: Any modern processor (M1/M2, Intel i5+, AMD Ryzen 5+)

Recommended:

16GB RAM (for 14B models)
SSD (faster model loading)
GPU optional (Ollama auto-detects and uses if available)

Apple Silicon (M1/M2/M3): Best performance. Ollama optimized for Metal acceleration.

Tested on Ollama 0.1.26, Continue 0.9.x, VS Code 1.86+, macOS 14 & Ubuntu 24.04