Running a Local AI Coding Assistant with Ollama and Continue.dev: No API Key Required

GitHub Copilot costs $19/month and sends your code to Microsoft's servers. CodeLlama on Ollama costs $0 and runs entirely on your machine — here's the setup that actually works.

You’ve probably already pulled a model with ollama run and felt the smug satisfaction of a local AI chat. Then you opened your IDE and realized the gap between a terminal chatbot and a real coding assistant is wider than the Grand Canyon. Tabnine’s free tier is anemic, and you’re not about to pipe your proprietary code through an API. The solution is to wire your local Ollama powerhouse directly into your editor. This guide gets you there, skipping the fluff and fixing the exact errors you’ll hit.

Picking Your Digital Intern: CodeLlama, DeepSeek-Coder, or Qwen2.5-Coder?

Your first mistake is pulling the generic llama3.1 for coding. It’s a brilliant generalist, but you need a specialist. The model you choose dictates your entire experience—speed, quality, and whether you’ll run out of VRAM before your next coffee.

CodeLlama (Meta): The established workhorse. The 34B parameter version is the sweet spot, scoring 53.7% on HumanEval vs. GPT-4's 67%. That’s more than enough for boilerplate generation, debugging, and docstring writing. It’s your reliable daily driver.
DeepSeek-Coder (DeepSeek-AI): The new contender that punches above its weight. Its 33B model often outperforms CodeLlama 34B on reasoning benchmarks and has stellar multi-language support. It’s hungry for context, though—the 64K window variants are memory hogs.
Qwen2.5-Coder (Alibaba): The efficiency expert. The 7B model is shockingly capable for its size, making it perfect if you’re on a laptop or have less than 8GB of VRAM. It’s the best “get something decent running now” option.

Here’s the cold, hard data to inform your choice. Remember, 70% of self-hosted LLM users cite data privacy as their primary reason (a16z AI survey 2025), so none of these phone home.

Model & Variant	Size	Key Strength	VRAM (4-bit)	Speed (RTX 4090)	Best For
CodeLlama 34B	34B	Reliable, balanced performance	~20 GB	~45 tok/s	Daily coding, boilerplate, debugging
DeepSeek-Coder 33B	33B	Strong reasoning, huge context	~22 GB	~40 tok/s	Complex refactors, long code files
Qwen2.5-Coder 7B	7B	Extremely efficient, fast	~5 GB	~120 tok/s	Laptops, quick completions, low-resource
phi-3-mini 3.8B	3.8B	Tiny but mighty (69% on MMLU)	~3 GB	~180 tok/s	Embedded/low-power systems, simple scripts

The Verdict: Start with codellama:34b-instruct-q4_K_M. It’s the best balance of intelligence and resource use. If you’re on a MacBook or have less VRAM, go for qwen2.5-coder:7b-instruct-q4_K_M.

Installing Ollama and Pulling the Correct Model Variant

You have Ollama installed. Great. Now let’s pull the right model variant, not the default. This is where most people waste hours of download time and disk space.

Open your terminal. Do not pull the base tag. The base codellama pulls the 7B model, which is often not what you want. Be specific.


ollama pull codellama

# RIGHT: This pulls the quantized 34B instruct model you actually want
ollama pull codellama:34b-instruct-q4_K_M

The suffix q4_K_M is crucial. It’s a 4-bit quantization that cuts the model size (and VRAM requirement) by over half with minimal quality loss. For Mistral 7B, the difference is stark: fp16 needs 14GB VRAM, but the 4-bit quant needs only 5GB. This is the difference between running and crashing.

🚨 Real Error & Fix:

Error: model 'llama3' not found

Fix: The model library has evolved. You likely need the version suffix. Run ollama pull llama3.1:8b-instruct-q4_K_M (note the .1 and the specific variant).

After the pull, verify it’s there and run a quick smoke test.

ollama list
ollama run codellama:34b-instruct-q4_K_M "Write a Python function to reverse a linked list."

If it responds coherently, your model engine is running. Now we need to connect it to the cockpit.

Configuring Continue.dev to Point at Your Local Ollama Server

Continue.dev is the open-source VS Code extension that acts as a universal bridge between your editor and any LLM. It’s the control panel GitHub Copilot never gave you. Install it from the VS Code marketplace.

The magic happens in Continue’s configuration. Open the command palette (Ctrl+Shift+P), type Continue: Open Config, and create or edit the ~/.continue/config.json file.

Here’s the exact configuration that works, replacing the placeholder http://localhost:11434 with your Ollama server’s API endpoint.

{
  "models": [
    {
      "title": "Local CodeLlama",
      "provider": "ollama",
      "model": "codellama:34b-instruct-q4_K_M",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Local CodeLlama",
    "provider": "ollama",
    "model": "codellama:34b-instruct-q4_K_M",
    "apiBase": "http://localhost:11434"
  }
}

Key Points:

provider: Must be "ollama".
apiBase: Must point to localhost:11434 (Ollama’s default REST API port).
tabAutocompleteModel: This is critical. This setting tells Continue to use your local model for inline code completions (the Copilot-like suggestions), not just the chat panel.

🚨 Real Error & Fix:

Connection refused on port 11434

Fix: The Ollama server isn’t running. Either run ollama serve in a terminal first, or if you’re on Linux, check systemctl status ollama and start it with sudo systemctl start ollama.

Restart VS Code. You should now see "Local CodeLlama" selected in the Continue sidebar. Try asking it a question in the chat. Then, start typing in a code file—you should get local, private autocomplete suggestions.

Context Window Tuning: The Speed vs. Intelligence Knob

Your model has a context window (e.g., 4096, 8192, 32k tokens). This is the short-term memory you pay for with every generation. Sending the entire 10,000-line package-lock.json file to a 4K context model will fail. Sending a 100-line function to a 64K model will be slow to start.

Continue.dev lets you control this. In your config.json, you can add a contextLength parameter to the model definition. Ollama API first-token latency is ~300ms local vs ~800ms for GPT-4o API, but that advantage evaporates if you’re waiting for a huge context to load.

{
  "models": [
    {
      "title": "Local DeepSeek-Coder",
      "provider": "ollama",
      "model": "deepseek-coder:33b-instruct-q4_K_M",
      "apiBase": "http://localhost:11434",
      "contextLength": 16384 // Manually set to 16K, not the full 64K
    }
  ]
}

The Tradeoff: Larger context = the model can see more of your project (better answers) = slower prompt processing and higher memory use. For most single-file tasks, 8K is plenty. For cross-file refactoring, bump it to 16K or 32K if your model and hardware support it.

Pro Tip: If you experience a slow first response (~30s) after idle time, it’s because Ollama unloads the model. Set the environment variable OLLAMA_KEEP_ALIVE=24h to keep it warm in memory.

Benchmark: How Does Local CodeLlama Actually Stack Up Against Copilot?

Let’s move beyond hype. On the HumanEval benchmark (solving Python coding problems), CodeLlama 34B scores 53.7% vs. GPT-4's 67%. In practice, this means:

Copilot (GPT-4 based) is better at novel, complex algorithm design.
Local CodeLlama is excellent at pattern-matching, boilerplate generation, debugging known error types, and writing documentation. It’s 95% as good for 80% of daily tasks.

The real win isn’t just the $19/month you save. It’s the latency and control. A Llama 3.1 8B runs at ~45 tokens/sec on an M3 Pro and ~120 tokens/sec on an RTX 4090. For long outputs, that’s 2x faster than GPT-4 API latency. The tokens stream instantly from your GPU, with no network jitter.

Language-Specific Tuning: Python, TypeScript, and Rust Configs

Generic models work, but you can tune them to be experts in your stack. Use an Ollama Modelfile to create a custom model variant. This is an advanced but powerful step.

Create a file named Codellama-34B-Python.Modelfile:

FROM codellama:34b-instruct-q4_K_M
# Set a system prompt to specialize the model
SYSTEM """You are an expert Python senior engineer. You write clean, PEP-8 compliant, production-ready code with type hints and comprehensive docstrings. You prefer the standard library and explain your choices briefly."""
PARAMETER temperature 0.2 # Lower temperature for more deterministic, factual code
PARAMETER num_ctx 8192

Create and use your custom model:

ollama create my-python-coder -f ./Codellama-34B-Python.Modelfile
ollama run my-python-coder

Then, point your config.json model field to "my-python-coder". Do the same for TypeScript (emphasizing ES6+, async/await, and avoiding any types) or Rust (focusing on memory safety and idiomatic patterns). This focuses the model’s energy, yielding more precise completions.

When Local Beats Cloud (And When to Swallow the Pill and Pay)

This setup is not a silver bullet. Know its limits.

Stick with Local Ollama when:

Privacy is non-negotiable: Your code never leaves your machine.
You’re offline: Planes, trains, or sketchy coffee shop Wi-Fi.
Cost is a factor: Running Llama 3.1 8B locally costs $0 vs ~$0.06/1K tokens on GPT-4o. For heavy users, this adds up fast.
You need instant, long completions: No API rate limits, no network lag.

Reach for the API key when:

You need state-of-the-art reasoning: For a truly novel, complex problem, GPT-4 or Claude Opus still holds an edge.
You lack the hardware: Trying to run a 70B model without sufficient VRAM is a recipe for frustration. (VRAM OOM with 70B model? The fix is to use the quantized variant: ollama run llama3.1:70b-instruct-q4_K_M for ~40GB VRAM).
You need multimodal: Local vision models are still playing catch-up.

Next Steps: From Assistant to Full Workflow

You now have a private, capable coding assistant. Ollama hit 5M downloads for a reason—it’s the simplest on-ramp to local AI. But this is just the start.

Experiment with Models: The library now supports 150+ models. Try deepseek-coder:33b for its reasoning or phi-3-mini for a shockingly good 3.8B model.
Integrate with Your Tools: Use the Ollama REST API (curl http://localhost:11434/api/generate) with scripts, or hook it into LangChain or LlamaIndex for more complex document-based Q&A about your codebase.
Try Other Frontends: Continue.dev is fantastic for VS Code. For a browser-based chat interface, try Open WebUI. For a desktop app, check out Jan.ai.
Push the Hardware: If you have the GPU, experiment with larger, unquantized models for the highest quality. The difference between 4-bit and 8-bit can be noticeable for intricate tasks.

Your development loop is now fundamentally different. Your completions are free, private, and limited only by your hardware. The silicon tears from your GPU are now tears of joy, doing the work it was meant for, right where the code lives.