Menu
← All Categories

Ollama

Browse articles on Ollama — tutorials, guides, and in-depth comparisons.

490 articles Browse all topics

Ollama is the fastest way to run large language models locally — one command to pull a model, one command to run it. No Python environment, no API keys, no cloud dependency.

What You Can Do with Ollama

  • Run 100+ open-source LLMs — Llama 3.3, Mistral, DeepSeek R1, Qwen 2.5, Gemini, and more
  • OpenAI-compatible REST API — drop-in replacement for api.openai.com in any app
  • GPU acceleration — NVIDIA CUDA, AMD ROCm, and Apple Metal (M1/M2/M3) out of the box
  • Modelfiles — customize system prompts, temperature, and context length per model
  • Multimodal — vision models like LLaVA and BakLLaVA for image + text tasks

Quick Start

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run Llama 3.3 (4GB RAM needed for 8B, 35GB for 70B Q4)
ollama pull llama3.3
ollama run llama3.3

# OpenAI-compatible API (port 11434)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.3","messages":[{"role":"user","content":"Hello"}]}'

Learning Path

  1. Install Ollama and run your first model — setup on Mac, Linux, Windows
  2. Choose the right quantization — Q4_K_M for quality, Q3_K_S for low VRAM
  3. Create a Modelfile — custom system prompts, parameters, persistent config
  4. Connect to your app — Python requests, LangChain, LlamaIndex, or direct REST
  5. Scale up — GPU layer offloading, concurrent requests, load balancing

Model Selection Guide

ModelSizeBest forVRAM needed
Llama 3.3 8B4.7GBGeneral use, fast6GB
Llama 3.3 70B Q435GBHigh quality16GB + RAM
DeepSeek R1 7B4.7GBReasoning tasks6GB
Qwen 2.5-Coder 7B4.7GBCode generation6GB
nomic-embed-text274MBEmbeddings / RAGCPU OK

Showing 1–30 of 490 articles · Page 1 of 17