Ollama

Browse articles on Ollama — tutorials, guides, and in-depth comparisons.

Ollama is the fastest way to run large language models locally — one command to pull a model, one command to run it. No Python environment, no API keys, no cloud dependency.

What You Can Do with Ollama

Run 100+ open-source LLMs — Llama 3.3, Mistral, DeepSeek R1, Qwen 2.5, Gemini, and more
OpenAI-compatible REST API — drop-in replacement for api.openai.com in any app
GPU acceleration — NVIDIA CUDA, AMD ROCm, and Apple Metal (M1/M2/M3) out of the box
Modelfiles — customize system prompts, temperature, and context length per model
Multimodal — vision models like LLaVA and BakLLaVA for image + text tasks

Quick Start

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run Llama 3.3 (4GB RAM needed for 8B, 35GB for 70B Q4)
ollama pull llama3.3
ollama run llama3.3

# OpenAI-compatible API (port 11434)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.3","messages":[{"role":"user","content":"Hello"}]}'

Learning Path

Install Ollama and run your first model — setup on Mac, Linux, Windows
Choose the right quantization — Q4_K_M for quality, Q3_K_S for low VRAM
Create a Modelfile — custom system prompts, parameters, persistent config
Connect to your app — Python requests, LangChain, LlamaIndex, or direct REST
Scale up — GPU layer offloading, concurrent requests, load balancing

Model Selection Guide

Model	Size	Best for	VRAM needed
Llama 3.3 8B	4.7GB	General use, fast	6GB
Llama 3.3 70B Q4	35GB	High quality	16GB + RAM
DeepSeek R1 7B	4.7GB	Reasoning tasks	6GB
Qwen 2.5-Coder 7B	4.7GB	Code generation	6GB
nomic-embed-text	274MB	Embeddings / RAG	CPU OK

Showing 361–390 of 490 articles · Page 13 of 17