Run Your Own AI Coding Assistant on a $300 Server

Build a self-hosted Devin alternative using open-source LLMs on used enterprise hardware. Complete setup in 2 hours.

Problem: AI Coding Tools Cost $500+/Month

Devin, Cursor Pro, and GitHub Copilot Enterprise charge premium prices while sending your code to external servers. You need AI assistance without the monthly bill or privacy concerns.

You'll learn:

  • How to run DeepSeek-Coder-V2 or Qwen2.5-Coder locally
  • Setting up Ollama + Continue.dev for VS Code integration
  • Hardware requirements for smooth inference (8-16GB VRAM minimum)

Time: 2 hours | Level: Advanced


Why This Works in 2026

Open-source coding models now match GPT-4 quality for many tasks. DeepSeek-Coder-V2 (236B parameters) and Qwen2.5-Coder-32B run efficiently on consumer hardware with quantization.

Common use cases:

  • Code completion and refactoring (works offline)
  • Debug assistance without sending logs externally
  • Learning projects where cloud costs don't make sense

What you need:

  • Used server with NVIDIA GPU (8GB+ VRAM)
  • 32GB+ system RAM
  • 500GB SSD storage
  • Basic Linux knowledge

Hardware: The $300 Sweet Spot

Option A: Dell PowerEdge R730 ($200-300)

  • 2x Xeon E5-2680 v4 (28 cores total)
  • 64GB DDR4 ECC RAM
  • Add: NVIDIA RTX 3060 12GB ($150 used)
  • Total: ~$350

Option B: HP Z440 Workstation ($150-200)

  • Xeon E5-1650 v4 (6 cores)
  • 32GB DDR4 RAM
  • Add: RTX 3060 12GB or AMD RX 6600 XT 8GB
  • Total: ~$300

Why these work:

  • PCIe 3.0 x16 for GPU (no bottleneck)
  • ECC RAM prevents memory corruption during long inference
  • Dual PSU options for power headroom

Power consumption: 150-250W under load (~$25-40/month at $0.12/kWh)


Solution

Step 1: Install Ubuntu Server 24.04 LTS

# Download and create bootable USB
wget https://releases.ubuntu.com/24.04/ubuntu-24.04-live-server-amd64.iso

# Use Rufus (Windows) or dd (Linux) to flash USB
sudo dd if=ubuntu-24.04-live-server-amd64.iso of=/dev/sdX bs=4M status=progress

During installation:

  • Enable OpenSSH server
  • Skip Docker install (we'll use official method)
  • Use entire disk for LVM

Expected: Boots to login prompt, SSH works from your main machine


Step 2: Install NVIDIA Drivers + CUDA

# Add NVIDIA repository
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update

# Install driver and CUDA toolkit
sudo apt install -y nvidia-driver-550 cuda-toolkit-12-4

# Reboot to load driver
sudo reboot

Verify installation:

nvidia-smi

You should see: GPU name, driver version 550+, CUDA 12.4

If it fails:

  • "NVIDIA-SMI has failed": Check if GPU is in PCIe slot, reseat card
  • No output: Run sudo ubuntu-drivers autoinstall then reboot

Step 3: Install Ollama for Model Management

# Install Ollama (manages model downloads and inference)
curl -fsSL https://ollama.com/install.sh | sh

# Verify service is running
sudo systemctl status ollama

Why Ollama: Handles quantization, model caching, and API endpoints automatically

Download a coding model:

# DeepSeek-Coder-V2 16B (fits in 12GB VRAM with 4-bit quantization)
ollama pull deepseek-coder-v2:16b-lite-instruct-q4_K_M

# Or for 8GB VRAM: Qwen2.5-Coder 7B
ollama pull qwen2.5-coder:7b-instruct-q4_K_M

Expected: Downloads ~9GB for 16B model, ~4GB for 7B model

Performance check:

# Test inference speed
ollama run deepseek-coder-v2:16b-lite-instruct-q4_K_M "Write a Python function to reverse a string"

You should see: Response in 2-5 seconds on RTX 3060, coherent code output


Step 4: Install Continue.dev for VS Code Integration

On your main development machine (not the server):

# Install Continue extension in VS Code
code --install-extension continue.continue

Configure Continue (~/.continue/config.json):

{
  "models": [
    {
      "title": "DeepSeek-Coder-V2 (Home Lab)",
      "provider": "ollama",
      "model": "deepseek-coder-v2:16b-lite-instruct-q4_K_M",
      "apiBase": "http://YOUR_SERVER_IP:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen2.5-Coder",
    "provider": "ollama", 
    "model": "qwen2.5-coder:7b-instruct-q4_K_M",
    "apiBase": "http://YOUR_SERVER_IP:11434"
  }
}

Replace YOUR_SERVER_IP with your server's local IP (find with ip addr on server)

Why two models: Large model for chat/refactoring, small fast model for autocomplete


Step 5: Configure Firewall for Remote Access

# On the server: Allow Ollama API port
sudo ufw allow 11434/tcp
sudo ufw enable

# Verify Ollama is accessible
curl http://YOUR_SERVER_IP:11434/api/version

You should see: JSON response with Ollama version

Security note: This exposes Ollama on your local network. For internet access, use Tailscale or Cloudflare Tunnel (see Next steps)


Step 6: Optional - Add Web UI with Open WebUI

# Install Docker first
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
newgrp docker

# Run Open WebUI (ChatGPT-like interface)
docker run -d \
  --name open-webui \
  -p 3000:8080 \
  -e OLLAMA_BASE_URL=http://localhost:11434 \
  -v open-webui:/app/backend/data \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Access at: http://YOUR_SERVER_IP:3000

Expected: Web interface for chatting with your models, no API keys needed


Verification

Test 1: Code Completion in VS Code

  1. Open any .py or .js file
  2. Start typing a function
  3. Press Tab when suggestion appears

You should see: Autocomplete suggestions from your local model in <1 second

Test 2: Chat Refactoring

  1. Select a code block in VS Code
  2. Press Cmd+I (Mac) or Ctrl+I (Windows)
  3. Type: "Add error handling to this function"

You should see: Refactored code with try/catch blocks

Test 3: Monitor GPU Usage

# On server
watch -n 1 nvidia-smi

During inference you should see:

  • GPU utilization 80-100%
  • Memory usage 8-12GB (for 16B model)
  • Temperature <80°C

If temperature exceeds 80°C: Check GPU fans, improve case airflow


What You Learned

  • Open-source models (DeepSeek, Qwen) rival commercial options for code tasks
  • 12GB VRAM is the sweet spot for quantized 16B models
  • Used enterprise hardware provides best price/performance

Limitations:

  • Smaller models (7B) make more mistakes than GPT-4
  • No web browsing or realtime data (yet)
  • Requires local network or VPN for remote access

When NOT to use this:

  • You need 100% uptime (cloud is more reliable)
  • Team collaboration (shared infrastructure is complex)
  • You're on metered power ($500/year electricity vs $200/year cloud might not save money)

Cost Breakdown

Initial investment:

  • Used server: $200-300
  • GPU (RTX 3060 12GB): $150-200
  • Storage (1TB NVMe): $60
  • Total: $410-560

Monthly costs:

  • Electricity (200W avg): $28/month
  • No subscription fees

Break-even vs Cursor Pro ($20/month): 20-28 months

Break-even vs Devin ($500/month): 1-2 months


Troubleshooting

"Out of Memory" Error

Symptom: Ollama crashes during inference

Solution:

# Use smaller quantization
ollama pull deepseek-coder-v2:16b-lite-instruct-q3_K_M  # 3-bit instead of 4-bit

# Or switch to 7B model
ollama pull qwen2.5-coder:7b-instruct-q4_K_M

Slow Inference (>10s per response)

Check CPU bottleneck:

htop

If CPU is 100%: You need faster CPU or GPU is not being used

Force GPU usage:

# Verify CUDA is available to Ollama
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

VS Code Can't Connect to Server

Test network connectivity:

# From your dev machine
curl http://SERVER_IP:11434/api/tags

If timeout:

  • Check firewall: sudo ufw status
  • Verify Ollama is listening on all interfaces: sudo netstat -tulpn | grep 11434
  • Edit /etc/systemd/system/ollama.service and add Environment="OLLAMA_HOST=0.0.0.0:11434"

Alternative Models to Try

ModelVRAMUse CaseStrengths
DeepSeek-Coder-V2 16B12GBGeneral codingBest code quality
Qwen2.5-Coder 32B24GBAdvanced tasksReasoning, math
CodeLlama 34B24GBLegacy supportPython, C++ expert
StarCoder2 15B12GBFast autocompleteLow latency

Quantization cheat sheet:

  • q4_K_M = 4-bit, best quality/speed balance
  • q3_K_M = 3-bit, fits in smaller VRAM
  • q8_0 = 8-bit, near full quality but 2x VRAM

Tested on Ubuntu 24.04 LTS, NVIDIA RTX 3060 12GB, Ollama 0.3.6, Continue.dev 0.9.x

Power tip: Join r/homelab and r/LocalLLaMA for deals on used hardware and model recommendations.