Local LLM
Run AI models locally with Ollama, LM Studio, and open-source LLMs — privacy-first AI development
Running LLMs locally gives you privacy, zero API costs, and offline capability. The tooling has matured to the point where a one-command install gets you a ChatGPT-quality experience on consumer hardware.
Tool Comparison
| Tool | Best for | UI | API | GPU support |
|---|---|---|---|---|
| Ollama | Developers, scripting, Docker | CLI | ✅ OpenAI-compatible | NVIDIA, AMD, Apple |
| LM Studio | Beginners, GUI, model browsing | ✅ Desktop | ✅ Local server | NVIDIA, Apple Silicon |
| GPT4All | Offline-first, no telemetry | ✅ Desktop | ✅ | CPU + NVIDIA |
| Jan AI | Privacy-focused desktop chat | ✅ Desktop | ✅ | NVIDIA, Apple |
| llama.cpp | Maximum control, custom builds | CLI | ✅ Server mode | All backends |
| AnythingLLM | Teams, document chat, RAG | ✅ Web UI | ✅ | Via Ollama |
Hardware Requirements
| Model size | RAM needed | GPU option | Quality |
|---|---|---|---|
| 3B–4B (Q4) | 4GB RAM | CPU OK | Good for simple tasks |
| 7B–8B (Q4) | 8GB RAM | 6GB VRAM | General purpose |
| 13B (Q4) | 16GB RAM | 8GB VRAM | Strong reasoning |
| 70B (Q4) | 48GB RAM | 24GB VRAM | Near GPT-4 level |
Quick Start — Three Options
# Option 1: Ollama (developers, API-first)
curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3.3
# Option 2: LM Studio (beginners, GUI)
# Download from lmstudio.ai → search models → one-click download
# Option 3: llama.cpp (maximum control)
git clone https://github.com/ggml-org/llama.cpp
make GGML_CUDA=1 # NVIDIA GPU
./llama-server -m models/llama-3.3-8b-q4_k_m.gguf --port 8080
Model Selection Guide
| Model | Strengths | Quantization | Tool |
|---|---|---|---|
| Llama 3.3 8B | General purpose, fast | Q4_K_M | Ollama, LM Studio |
| DeepSeek R1 7B | Reasoning, math | Q4_K_M | Ollama |
| Qwen 2.5-Coder 7B | Code generation | Q4_K_M | Ollama, LM Studio |
| Mistral 7B | Fast, instruction following | Q4_K_M | Ollama |
| nomic-embed-text | Embeddings only | — | Ollama |
Learning Path
- Install your tool — Ollama for API access, LM Studio for GUI
- Pick the right model — match task to model strengths
- Understand quantization — Q4_K_M is the default sweet spot
- Configure GPU offloading — maximize VRAM usage
- Connect to your app — OpenAI-compatible API, LangChain, or direct HTTP
- RAG with local LLMs — combine with Ollama embeddings + pgvector
Showing 1–30 of 75 articles · Page 1 of 3
- vLLM vs TGI: LLM Serving Framework Comparison 2026
- Split Large Models Across GPUs: LM Studio Multi-GPU Setup 2026
- Setup LM Studio Preset System Prompts: Custom Chat Templates 2026
- Run SGLang: Fast LLM Inference with Structured Generation 2026
- Run MLX Models in LM Studio: Apple Silicon Guide 2026
- Run llama.cpp Server: OpenAI-Compatible API from GGUF Models 2026
- LM Studio vs Ollama: Developer Experience Comparison 2026
- LM Studio GGUF vs GPTQ: Which Quantization Format? 2026
- Deploy vLLM: Production LLM API with OpenAI Compatibility 2026
- Deploy ML Workloads on Modal Serverless GPU Compute 2026
- Configure LM Studio GPU Layers: Optimize VRAM Usage 2026
- Compile llama.cpp: CPU, CUDA, and Metal Backends 2026
- Build Apps with LM Studio REST API and Local LLMs 2026
- Setup LM Studio API Server: OpenAI-Compatible Local Endpoint 2026
- Run Qwen2.5 Quantized GGUF on 8GB VRAM: Local Setup 2026
- Run Qwen 2.5 72B Locally: Ollama and LM Studio Setup 2026
- Run and Fine-Tune LLMs on Mac with MLX-LM 2026
- Manage LM Studio Models: Download, Organize, Switch 2026
- Deploy Qwen2.5-VL Locally: Vision Language Model Setup 2026
- DeepSeek V3.2 vs Llama 4: Which to Self-Host in 2026?
- Running a Local AI Coding Assistant with Ollama and Continue.dev: No API Key Required
- Reliable Structured Output from Local LLMs: JSON Extraction Without Hallucination
- Multi-GPU Ollama Setup for Large Model Inference: 70B Models on Consumer Hardware
- Fine-Tuning Local LLMs with Ollama: Domain Adaptation Without Cloud Costs
- Deploying Ollama as a Production API Server: Docker, Load Balancing, and Monitoring
- Building a Private RAG System with Ollama: Chat with Your Documents Locally
- AI Model Supply Chain Security: Verifying Weights Before You Run Them
- Run Llama 5 70B Locally on MacBook Pro M5 in 15 Minutes
- Run AI Free: Open-Source Models That Beat Paid Subscriptions
- Protect Fine-Tuned Model Weights from Theft in 2026