Problem: You Need AI Coding Help Without Internet Access
Your company blocks external AI services, you work in classified environments, or you simply don't want your code sent to third-party servers. But you still want AI assistance for autocomplete, refactoring, and debugging.
You'll learn:
- How to run production-ready LLMs completely offline
- Setting up Continue.dev with local models
- Performance tuning for 16GB+ RAM systems
- Validating zero network traffic
Time: 45 min | Level: Intermediate
Why This Matters
Cloud AI tools like GitHub Copilot and ChatGPT send your code to external servers. For sensitive projects—defense, healthcare, finance, or proprietary systems—this violates security policies.
Common blockers:
- Corporate firewall blocks AI APIs
- Compliance requires air-gapped development
- Unreliable internet in remote locations
- Privacy concerns about code exposure
What you get:
- 100% local inference (verified with network monitoring)
- Code never leaves your machine
- Works offline indefinitely
- Free and open source
Prerequisites
Hardware requirements:
- Minimum: 16GB RAM, 4-core CPU, 10GB disk space
- Recommended: 32GB RAM, 8-core CPU, 50GB disk (for larger models)
- GPU: Optional but 3x faster (NVIDIA with 8GB+ VRAM)
Software:
- VS Code or Cursor
- Docker (optional, for containerized setup)
- Linux/macOS/Windows with WSL2
Download ahead (you'll need these offline):
- Ollama installer: https://ollama.com/download
- Continue extension:
.vsixfile from GitHub releases - Model files: We'll download before disconnecting
Solution
Step 1: Install Ollama (The Model Runtime)
Ollama runs LLMs locally like Docker runs containers.
macOS/Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows (WSL2):
curl -fsSL https://ollama.com/install.sh | sh
# Or download .exe from ollama.com/download
Verify installation:
ollama --version
# Should show: ollama version 0.1.x
Why Ollama? It handles model loading, memory management, and provides a standard API that works with all AI coding tools.
Step 2: Download Models While Online
Pull models before going offline. We'll use Codestral (7B parameters, best for code).
# Primary coding model (7GB download)
ollama pull codestral:latest
# Fallback lightweight model (4GB)
ollama pull deepseek-coder:6.7b-base
# Verify downloads
ollama list
Expected output:
NAME SIZE MODIFIED
codestral:latest 7.2 GB 2 minutes ago
deepseek-coder:6.7b 4.1 GB 5 minutes ago
Model selection guide:
- codestral:latest (7B) - Best code completion, needs 10GB RAM
- deepseek-coder:6.7b - Fast, good for 16GB systems
- qwen2.5-coder:7b - Alternative, strong at Python/JS
If download fails:
- Timeout errors: Retry with
ollama pull --insecure - Out of space: Remove unused Docker images (
docker system prune) - Slow network: Download overnight, models are 4-15GB
Step 3: Test Local Inference
# Start interactive chat
ollama run codestral
Try this prompt:
Write a Python function to validate email addresses
You should see: Generated code in ~2-5 seconds. Press Ctrl+D to exit.
Performance check:
# Monitor resource usage
top # Linux/Mac
# Or: Task Manager on Windows
# Look for 'ollama' process using 4-8GB RAM
Step 4: Install Continue.dev Extension
Continue integrates Ollama into VS Code for autocomplete and chat.
Online method (easiest):
- Open VS Code
- Extensions → Search "Continue"
- Install → Reload
Offline method (air-gapped):
# Download .vsix from GitHub (while online)
wget https://github.com/continuedev/continue/releases/latest/download/continue.vsix
# Install offline
code --install-extension continue.vsix
Step 5: Configure Continue for Ollama
Press Cmd+Shift+P (Mac) or Ctrl+Shift+P (Windows) → "Continue: Open config.json"
Replace with this config:
{
"models": [
{
"title": "Codestral Local",
"provider": "ollama",
"model": "codestral:latest",
"apiBase": "http://localhost:11434"
}
],
"tabAutocompleteModel": {
"title": "Codestral Autocomplete",
"provider": "ollama",
"model": "codestral:latest",
"apiBase": "http://localhost:11434"
},
"allowAnonymousTelemetry": false,
"disableIndexing": false
}
Why these settings:
apiBase: localhost:11434- Ollama's default local port (no internet)allowAnonymousTelemetry: false- Zero data collection- Same model for chat and autocomplete (simplicity)
Save and reload VS Code.
Step 6: Test Autocomplete
Create a new file: test.py
Type this (wait 1-2 seconds after typing):
def calculate_fibonacci(n):
#
Expected: Autocomplete suggests function implementation. Press Tab to accept.
If no suggestions:
- Check Ollama is running:
ps aux | grep ollama - Test API:
curl http://localhost:11434/api/tags - Restart VS Code
- Check Continue output panel for errors
Step 7: Verify Air-Gap (Critical Security Step)
Network monitoring (Linux/Mac):
# Install network monitor
sudo tcpdump -i any -n | grep -v '127.0.0.1'
# In another Terminal, use Continue
# You should see ZERO external traffic, only localhost (127.0.0.1)
Windows:
- Open Resource Monitor → Network tab
- Use Continue to generate code
- Verify no
ollama.exeorCode.exenetwork activity except localhost
What you're checking:
- ✅ Only
127.0.0.1:11434traffic (local Ollama) - ❌ No connections to openai.com, anthropic.com, github.com
- ❌ No DNS lookups to external domains
Now disconnect internet - everything should still work.
Step 8: Optimize Performance
If autocomplete is slow (>3 seconds):
# Increase Ollama context size
export OLLAMA_NUM_CTX=4096 # Default is 2048
export OLLAMA_NUM_PARALLEL=2 # Parallel requests
# Restart Ollama
pkill ollama
ollama serve &
GPU acceleration (NVIDIA only):
# Verify GPU detected
ollama run codestral --verbose
# Should show: "Using GPU: NVIDIA GeForce RTX..."
RAM tuning:
// Continue config.json
{
"tabAutocompleteModel": {
"model": "codestral:latest",
"completionOptions": {
"maxTokens": 256, // Reduce if running out of memory
"temperature": 0.2 // Lower = more deterministic
}
}
}
Advanced: Containerized Air-Gap Setup
For maximum isolation:
# Dockerfile
FROM ollama/ollama:latest
# Copy pre-downloaded models
COPY models/ /root/.ollama/models/
EXPOSE 11434
CMD ["serve"]
# Build image (while online)
docker build -t air-gapped-ai .
# Export for offline transfer
docker save air-gapped-ai > ai-image.tar
# Load on air-gapped machine
docker load < ai-image.tar
docker run -d -p 11434:11434 air-gapped-ai
Use case: Deploy identical AI setup across multiple secure workstations.
Verification
Test checklist:
Autocomplete works:
def reverse_string(s): # Should suggest: return s[::-1]Chat works:
- Open Continue sidebar (
Cmd+L) - Ask: "Explain this code"
- Get response in <5 seconds
- Open Continue sidebar (
Zero network:
- Disconnect WiFi/ethernet
- Everything still functions
- Check with
netstat -an | grep ESTABLISHED(no external connections)
Performance:
- Autocomplete appears within 2 seconds
- Chat responses stream in real-time
- RAM usage stable (not constantly growing)
What You Learned
Security:
- Validated zero data leaves your machine
- Code never touches external servers
- Complies with air-gap requirements
Performance:
- 7B models provide 80% of Copilot quality
- GPU acceleration makes it near-instant
- Works on older hardware (16GB+ RAM)
Limitations:
- Not as smart as GPT-4 or Claude (but private)
- Initial model download requires internet
- Slower than cloud APIs on CPU-only systems
When NOT to use this:
- You need cutting-edge reasoning (use Claude/GPT with data agreements)
- Hardware constraints (<16GB RAM)
- Team collaboration requires shared model (consider on-prem serving)
Troubleshooting
Autocomplete not triggering:
# Check Continue logs
Cmd+Shift+P → "Developer: Show Logs" → Select "Continue"
# Common fix: Restart language server
Cmd+Shift+P → "Reload Window"
Ollama crashes/OOM:
# Use smaller model
ollama pull deepseek-coder:1.3b # Only 1.3GB
# Or increase swap
sudo fallocate -l 16G /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
Slow first inference:
- Normal - models load into RAM on first request
- Subsequent requests are fast
- Keep Ollama running (
ollama servein background)
Resources
Essential reading:
- Ollama docs: https://github.com/ollama/ollama/blob/main/docs/README.md
- Continue config reference: https://continue.dev/docs/reference/config
- Model leaderboard: https://evalplus.github.io/leaderboard.html
Security validation:
- NIST air-gap guidelines: https://csrc.nist.gov/glossary/term/air_gap
- Verify hash of downloaded models:
sha256sum ~/.ollama/models/blobs/*
Tested on: Ubuntu 24.04, macOS 14.3, Windows 11 WSL2 | Ollama 0.1.26 | Continue 0.8.x | February 2026
Security notice: This setup prevents code exfiltration but doesn't protect against model extraction attacks. For classified environments, validate models come from trusted sources and store them on encrypted volumes.