Problem: You Need AI Coding Help Without Internet Access

Your company blocks external AI services, you work in classified environments, or you simply don't want your code sent to third-party servers. But you still want AI assistance for autocomplete, refactoring, and debugging.

You'll learn:

How to run production-ready LLMs completely offline
Setting up Continue.dev with local models
Performance tuning for 16GB+ RAM systems
Validating zero network traffic

Time: 45 min | Level: Intermediate

Why This Matters

Cloud AI tools like GitHub Copilot and ChatGPT send your code to external servers. For sensitive projects—defense, healthcare, finance, or proprietary systems—this violates security policies.

Common blockers:

Corporate firewall blocks AI APIs
Compliance requires air-gapped development
Unreliable internet in remote locations
Privacy concerns about code exposure

What you get:

100% local inference (verified with network monitoring)
Code never leaves your machine
Works offline indefinitely
Free and open source

Prerequisites

Hardware requirements:

Minimum: 16GB RAM, 4-core CPU, 10GB disk space
Recommended: 32GB RAM, 8-core CPU, 50GB disk (for larger models)
GPU: Optional but 3x faster (NVIDIA with 8GB+ VRAM)

Software:

VS Code or Cursor
Docker (optional, for containerized setup)
Linux/macOS/Windows with WSL2

Download ahead (you'll need these offline):

Ollama installer: https://ollama.com/download
Continue extension: .vsix file from GitHub releases
Model files: We'll download before disconnecting

Solution

Step 1: Install Ollama (The Model Runtime)

Ollama runs LLMs locally like Docker runs containers.

macOS/Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows (WSL2):

curl -fsSL https://ollama.com/install.sh | sh
# Or download .exe from ollama.com/download

Verify installation:

ollama --version
# Should show: ollama version 0.1.x

Why Ollama? It handles model loading, memory management, and provides a standard API that works with all AI coding tools.

Step 2: Download Models While Online

Pull models before going offline. We'll use Codestral (7B parameters, best for code).

# Primary coding model (7GB download)
ollama pull codestral:latest

# Fallback lightweight model (4GB)
ollama pull deepseek-coder:6.7b-base

# Verify downloads
ollama list

Expected output:

NAME                    SIZE      MODIFIED
codestral:latest        7.2 GB    2 minutes ago
deepseek-coder:6.7b     4.1 GB    5 minutes ago

Model selection guide:

codestral:latest (7B) - Best code completion, needs 10GB RAM
deepseek-coder:6.7b - Fast, good for 16GB systems
qwen2.5-coder:7b - Alternative, strong at Python/JS

If download fails:

Timeout errors: Retry with ollama pull --insecure
Out of space: Remove unused Docker images (docker system prune)
Slow network: Download overnight, models are 4-15GB

Step 3: Test Local Inference

# Start interactive chat
ollama run codestral

Try this prompt:

Write a Python function to validate email addresses

You should see: Generated code in ~2-5 seconds. Press Ctrl+D to exit.

Performance check:

# Monitor resource usage
top  # Linux/Mac
# Or: Task Manager on Windows

# Look for 'ollama' process using 4-8GB RAM

Step 4: Install Continue.dev Extension

Continue integrates Ollama into VS Code for autocomplete and chat.

Online method (easiest):

Open VS Code
Extensions → Search "Continue"
Install → Reload

Offline method (air-gapped):

# Download .vsix from GitHub (while online)
wget https://github.com/continuedev/continue/releases/latest/download/continue.vsix

# Install offline
code --install-extension continue.vsix

Step 5: Configure Continue for Ollama

Press Cmd+Shift+P (Mac) or Ctrl+Shift+P (Windows) → "Continue: Open config.json"

Replace with this config:

{
  "models": [
    {
      "title": "Codestral Local",
      "provider": "ollama",
      "model": "codestral:latest",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Codestral Autocomplete",
    "provider": "ollama",
    "model": "codestral:latest",
    "apiBase": "http://localhost:11434"
  },
  "allowAnonymousTelemetry": false,
  "disableIndexing": false
}

Why these settings:

apiBase: localhost:11434 - Ollama's default local port (no internet)
allowAnonymousTelemetry: false - Zero data collection
Same model for chat and autocomplete (simplicity)

Save and reload VS Code.

Step 6: Test Autocomplete

Create a new file: test.py

Type this (wait 1-2 seconds after typing):

def calculate_fibonacci(n):
    #

Expected: Autocomplete suggests function implementation. Press Tab to accept.

If no suggestions:

Check Ollama is running: ps aux | grep ollama
Test API: curl http://localhost:11434/api/tags
Restart VS Code
Check Continue output panel for errors

Step 7: Verify Air-Gap (Critical Security Step)

Network monitoring (Linux/Mac):

# Install network monitor
sudo tcpdump -i any -n | grep -v '127.0.0.1'

# In another Terminal, use Continue
# You should see ZERO external traffic, only localhost (127.0.0.1)

Windows:

Open Resource Monitor → Network tab
Use Continue to generate code
Verify no ollama.exe or Code.exe network activity except localhost

What you're checking:

✅ Only 127.0.0.1:11434 traffic (local Ollama)
❌ No connections to openai.com, anthropic.com, github.com
❌ No DNS lookups to external domains

Now disconnect internet - everything should still work.

Step 8: Optimize Performance

If autocomplete is slow (>3 seconds):

# Increase Ollama context size
export OLLAMA_NUM_CTX=4096  # Default is 2048
export OLLAMA_NUM_PARALLEL=2  # Parallel requests

# Restart Ollama
pkill ollama
ollama serve &

GPU acceleration (NVIDIA only):

# Verify GPU detected
ollama run codestral --verbose

# Should show: "Using GPU: NVIDIA GeForce RTX..."

RAM tuning:

// Continue config.json
{
  "tabAutocompleteModel": {
    "model": "codestral:latest",
    "completionOptions": {
      "maxTokens": 256,  // Reduce if running out of memory
      "temperature": 0.2  // Lower = more deterministic
    }
  }
}

Advanced: Containerized Air-Gap Setup

For maximum isolation:

# Dockerfile
FROM ollama/ollama:latest

# Copy pre-downloaded models
COPY models/ /root/.ollama/models/

EXPOSE 11434
CMD ["serve"]

# Build image (while online)
docker build -t air-gapped-ai .

# Export for offline transfer
docker save air-gapped-ai > ai-image.tar

# Load on air-gapped machine
docker load < ai-image.tar
docker run -d -p 11434:11434 air-gapped-ai

Use case: Deploy identical AI setup across multiple secure workstations.

Verification

Test checklist:

Autocomplete works:

def reverse_string(s):
    # Should suggest: return s[::-1]

Chat works:
- Open Continue sidebar (Cmd+L)
- Ask: "Explain this code"
- Get response in <5 seconds
Zero network:
- Disconnect WiFi/ethernet
- Everything still functions
- Check with netstat -an | grep ESTABLISHED (no external connections)
Performance:
- Autocomplete appears within 2 seconds
- Chat responses stream in real-time
- RAM usage stable (not constantly growing)

What You Learned

Security:

Validated zero data leaves your machine
Code never touches external servers
Complies with air-gap requirements

Performance:

7B models provide 80% of Copilot quality
GPU acceleration makes it near-instant
Works on older hardware (16GB+ RAM)

Limitations:

Not as smart as GPT-4 or Claude (but private)
Initial model download requires internet
Slower than cloud APIs on CPU-only systems

When NOT to use this:

You need cutting-edge reasoning (use Claude/GPT with data agreements)
Hardware constraints (<16GB RAM)
Team collaboration requires shared model (consider on-prem serving)

Troubleshooting

Autocomplete not triggering:

# Check Continue logs
Cmd+Shift+P → "Developer: Show Logs" → Select "Continue"

# Common fix: Restart language server
Cmd+Shift+P → "Reload Window"

Ollama crashes/OOM:

# Use smaller model
ollama pull deepseek-coder:1.3b  # Only 1.3GB

# Or increase swap
sudo fallocate -l 16G /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Slow first inference:

Normal - models load into RAM on first request
Subsequent requests are fast
Keep Ollama running (ollama serve in background)

Resources

Essential reading:

Ollama docs: https://github.com/ollama/ollama/blob/main/docs/README.md
Continue config reference: https://continue.dev/docs/reference/config
Model leaderboard: https://evalplus.github.io/leaderboard.html

Security validation:

NIST air-gap guidelines: https://csrc.nist.gov/glossary/term/air_gap
Verify hash of downloaded models: sha256sum ~/.ollama/models/blobs/*

Tested on: Ubuntu 24.04, macOS 14.3, Windows 11 WSL2 | Ollama 0.1.26 | Continue 0.8.x | February 2026

Security notice: This setup prevents code exfiltration but doesn't protect against model extraction attacks. For classified environments, validate models come from trusted sources and store them on encrypted volumes.