Problem: Your Code Is Training Someone Else's AI
You want AI autocomplete and chat in your editor, but sending proprietary code to GitHub Copilot, Claude, or ChatGPT violates your company's data policy or makes you uncomfortable.
You'll learn:
- How to run AI models locally on your machine
- Set up VS Code with private AI completion
- Configure chat and code generation without cloud APIs
- Performance tradeoffs between local and cloud models
Time: 20 min | Level: Intermediate
Why This Matters
Cloud-based AI coding tools send your code to external servers for processing. This creates privacy risks: your proprietary logic, API keys, and business rules become training data. Many companies ban these tools entirely.
Common concerns:
- Code snippets sent to cloud providers for inference
- Potential data retention in training datasets
- Compliance violations (GDPR, HIPAA, SOC 2)
- Offline work impossible without internet
Local AI models solve this by running inference on your hardware. Your code never leaves your machine.
Solution
Step 1: Install Ollama
Ollama runs open-source LLMs locally with a simple API.
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows (PowerShell as Admin)
winget install Ollama.Ollama
Expected: Installation completes, ollama command available.
Start the Ollama service:
ollama serve
If it fails:
- Port 11434 already in use: Check if Ollama is already running with
ps aux | grep ollama - macOS: "Unidentified developer": Go to System Settings → Privacy & Security → Allow
Step 2: Download a Coding Model
Pull a model optimized for code. DeepSeek Coder and CodeLlama work well for most tasks.
# Best for code completion (6.7GB)
ollama pull deepseek-coder:6.7b
# Alternative: lighter model (3.8GB)
ollama pull codellama:7b
# Best for chat/explanations (requires 16GB+ RAM)
ollama pull qwen2.5-coder:14b
Why these models: DeepSeek Coder 6.7B matches Copilot quality for autocomplete on hardware with 8GB+ RAM. Qwen 2.5 Coder handles complex refactoring better but needs more memory.
Expected: Download progress shown, model ready when complete.
Step 3: Install Continue Extension
Continue is an open-source VS Code extension that connects to local models.
- Open VS Code
- Install "Continue" extension from marketplace
- Click Continue icon in sidebar (⌘/Ctrl+L to open chat)
First run: Continue shows a config file. We'll set this up next.
Step 4: Configure Continue for Local Models
Open Continue settings (click gear icon in Continue sidebar), or edit ~/.continue/config.json:
{
"models": [
{
"title": "DeepSeek Coder",
"provider": "ollama",
"model": "deepseek-coder:6.7b",
"apiBase": "http://localhost:11434"
}
],
"tabAutocompleteModel": {
"title": "DeepSeek Autocomplete",
"provider": "ollama",
"model": "deepseek-coder:6.7b",
"apiBase": "http://localhost:11434"
},
"embeddingsProvider": {
"provider": "ollama",
"model": "nomic-embed-text",
"apiBase": "http://localhost:11434"
}
}
Why this config:
models: Chat interface for asking questionstabAutocompleteModel: Inline suggestions as you typeembeddingsProvider: Code context search (optional but helpful)
Pull the embedding model:
ollama pull nomic-embed-text
Expected: Continue now uses your local models. No API keys needed.
Step 5: Test It Out
Autocomplete test: Type a function signature and pause. Continue suggests completions within 1-2 seconds.
def calculate_fibonacci(n: int) -> int:
# Pause here - suggestion appears
Chat test:
Open Continue chat (⌘/Ctrl+L), ask: "Explain this function" while viewing code.
If completions are slow:
- >3 seconds delay: Model too large for RAM. Use
codellama:7binstead - No suggestions: Check Ollama is running:
curl http://localhost:11434/api/tags - Errors in Continue: View Output panel, select "Continue" to see logs
Verification
Create a test file:
// Type this and pause after the colon
function sortUsers(users: User[])
You should see: Inline completion suggestion within 2 seconds, no network requests in browser DevTools.
Check network isolation:
# Monitor network while using Continue
sudo tcpdump -i any host api.openai.com or host api.anthropic.com
# Should show: No packets captured
Performance Comparison
Local (DeepSeek Coder 6.7B):
- Latency: 1-2 seconds per completion
- Quality: ~85% of Copilot on simple tasks
- Cost: Free, uses ~6GB RAM
- Works offline: Yes
Cloud (GitHub Copilot):
- Latency: 200-500ms per completion
- Quality: Baseline reference
- Cost: $10/month
- Works offline: No
When local models struggle:
- Complex refactoring across files (use cloud for this)
- Explaining unfamiliar frameworks (cloud has more training data)
- Long-form documentation generation
Use hybrid: local for daily work, cloud API for complex tasks via separate tool.
Advanced: Multiple Models
Add specialized models for different tasks:
{
"models": [
{
"title": "Chat (Qwen)",
"provider": "ollama",
"model": "qwen2.5-coder:14b"
},
{
"title": "Quick Autocomplete (CodeLlama)",
"provider": "ollama",
"model": "codellama:7b"
}
]
}
Switch between them in Continue dropdown. Use lighter models for autocomplete, heavier for chat.
What You Learned
- Ollama runs open-source LLMs locally with zero configuration
- Continue provides Copilot-like features without cloud dependencies
- 6.7B parameter models match cloud quality for most coding tasks
- Privacy and offline work come at the cost of 1-2 second latency
Limitation: Local models need 8GB+ RAM. Laptops with 4GB won't run smoothly.
When NOT to use local:
- You have fast internet and no privacy concerns
- Need cutting-edge model quality (GPT-4 class)
- Working on a low-RAM machine (<8GB)
Troubleshooting
Ollama won't start:
# Check if service is running
ollama list
# Restart service
# macOS/Linux
killall ollama && ollama serve
# Windows
Restart-Service Ollama
Out of memory errors:
# Use smaller model
ollama pull codellama:7b
# Or increase swap (Linux)
sudo fallocate -l 16G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
Continue can't connect:
# Verify Ollama API is accessible
curl http://localhost:11434/api/tags
# Should return JSON with available models
Hardware Requirements
Minimum:
- 8GB RAM
- 10GB free disk space
- CPU: Any modern processor (M1/M2, Intel i5+, AMD Ryzen 5+)
Recommended:
- 16GB RAM (for 14B models)
- SSD (faster model loading)
- GPU optional (Ollama auto-detects and uses if available)
Apple Silicon (M1/M2/M3): Best performance. Ollama optimized for Metal acceleration.
Tested on Ollama 0.1.26, Continue 0.9.x, VS Code 1.86+, macOS 14 & Ubuntu 24.04