Chat With PDFs Locally Using RAG in 20 Minutes

Build a privacy-first document chatbot with local LLMs, vector search, and PDF parsing—no API keys or cloud services required.

Problem: Your Company Docs Can't Leave Your Network

You need to chat with internal PDFs (technical specs, manuals, contracts) but can't send them to OpenAI or Anthropic for compliance reasons.

You'll learn:

  • How to extract and chunk PDF text properly
  • Local embedding generation with sentence-transformers
  • Vector search without Pinecone or cloud databases
  • Running Llama 3 locally for answers

Time: 20 min | Level: Intermediate


Why This Happens

Cloud RAG services require uploading your documents to third-party servers. For regulated industries (healthcare, finance, defense) or proprietary docs, that's a non-starter.

Common symptoms:

  • Can't use ChatGPT plugins for internal docs
  • Legal blocks on cloud AI services
  • Need offline operation for secure environments
  • Want to avoid per-query API costs

Solution

Step 1: Install Dependencies

# Create isolated environment
python3 -m venv rag_env
source rag_env/bin/activate  # Windows: rag_env\Scripts\activate

# Core libraries
pip install --break-system-packages \
    pymupdf==1.24.0 \
    sentence-transformers==2.3.1 \
    chromadb==0.4.22 \
    ollama==0.1.7

Why these?

  • pymupdf: Fast PDF text extraction, handles complex layouts
  • sentence-transformers: Local embeddings (no API)
  • chromadb: Embedded vector DB, zero config
  • ollama: Local LLM runtime

Expected: Installation completes in 2-3 minutes. Sentence-transformers downloads ~500MB model on first run.


Step 2: Install Local LLM

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull Llama 3.1 8B (5GB download)
ollama pull llama3.1:8b

# Verify it works
ollama run llama3.1:8b "Hello"

If it fails:

  • Windows: Download installer from ollama.com/download
  • GPU not detected: Add CUDA_VISIBLE_DEVICES=0 to environment
  • Out of memory: Use llama3.1:3b instead (smaller model)

Step 3: Create the RAG Pipeline

# rag_chat.py
import fitz  # pymupdf
from sentence_transformers import SentenceTransformer
import chromadb
import ollama

class LocalRAG:
    def __init__(self, pdf_path):
        # Load embedding model (runs on CPU, ~2 sec)
        self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
        
        # Initialize vector store
        self.client = chromadb.Client()
        self.collection = self.client.create_collection("docs")
        
        # Process PDF
        self._ingest_pdf(pdf_path)
    
    def _ingest_pdf(self, pdf_path):
        """Extract text and create searchable chunks"""
        doc = fitz.open(pdf_path)
        chunks = []
        
        for page_num, page in enumerate(doc):
            text = page.get_text()
            
            # Chunk by paragraph (better than fixed size)
            paragraphs = [p.strip() for p in text.split('\n\n') if len(p.strip()) > 50]
            
            for i, para in enumerate(paragraphs):
                chunks.append({
                    'text': para,
                    'metadata': {'page': page_num + 1, 'chunk': i}
                })
        
        # Generate embeddings
        texts = [c['text'] for c in chunks]
        embeddings = self.embedder.encode(texts, show_progress_bar=True)
        
        # Store in vector DB
        self.collection.add(
            embeddings=embeddings.tolist(),
            documents=texts,
            ids=[f"chunk_{i}" for i in range(len(texts))],
            metadatas=[c['metadata'] for c in chunks]
        )
        
        print(f"✓ Indexed {len(chunks)} chunks from {len(doc)} pages")
    
    def ask(self, question):
        """Query the RAG system"""
        # Find relevant chunks
        query_embedding = self.embedder.encode([question])[0]
        
        results = self.collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=3  # Top 3 most relevant chunks
        )
        
        # Build context from retrieved chunks
        context = "\n\n".join(results['documents'][0])
        
        # Generate answer with local LLM
        prompt = f"""Based on this documentation:

{context}

Question: {question}

Answer concisely using only the information above. If the answer isn't in the docs, say so."""

        response = ollama.generate(
            model='llama3.1:8b',
            prompt=prompt,
            stream=False
        )
        
        return {
            'answer': response['response'],
            'sources': [f"Page {m['page']}" for m in results['metadatas'][0]]
        }

# Usage
if __name__ == "__main__":
    rag = LocalRAG("technical_manual.pdf")
    
    result = rag.ask("What are the safety precautions?")
    print(f"Answer: {result['answer']}")
    print(f"Sources: {', '.join(result['sources'])}")

Why this architecture:

  • Paragraph chunking preserves semantic meaning (better than 512-char splits)
  • all-MiniLM-L6-v2 model is fast (50ms per embedding) and accurate for retrieval
  • ChromaDB persists to disk automatically
  • 3 chunks balance context size vs relevance

Step 4: Test It

python rag_chat.py

You should see:

✓ Indexed 247 chunks from 45 pages
Answer: The manual lists three main precautions: 1) Wear protective equipment...
Sources: Page 12, Page 13, Page 15

Performance expectations:

  • PDF ingestion: ~2 seconds per page
  • Query response: 3-5 seconds (embedding + LLM generation)
  • Memory usage: ~4GB (LLM model + embeddings)

Step 5: Add Simple CLI Interface

# chat.py
from rag_chat import LocalRAG
import sys

def main():
    if len(sys.argv) < 2:
        print("Usage: python chat.py <pdf_file>")
        sys.exit(1)
    
    rag = LocalRAG(sys.argv[1])
    
    print("\nChat with your PDF (type 'quit' to exit)\n")
    
    while True:
        question = input("You: ").strip()
        
        if question.lower() in ['quit', 'exit']:
            break
        
        if not question:
            continue
        
        result = rag.ask(question)
        print(f"\nAssistant: {result['answer']}")
        print(f"📄 Sources: {', '.join(result['sources'])}\n")

if __name__ == "__main__":
    main()
python chat.py documentation.pdf

Interactive experience:

You: How do I reset the admin password?
Assistant: Navigate to Settings > Security > Reset Password...
📄 Sources: Page 34, Page 35

Verification

Accuracy test:

# Ask a question you know the answer to from the PDF
python chat.py manual.pdf
You: What is the warranty period?

You should see: Correct answer citing specific pages. If answer is wrong:

  • Too generic: Reduce n_results to 2 for more focused context
  • Hallucinating: Add temperature=0.3 to ollama.generate() for deterministic responses
  • Wrong pages: Check PDF text extraction with pdftotext manual.pdf to verify readability

Production Improvements

Better Chunking Strategy

def _smart_chunk(self, text, max_tokens=512):
    """Chunk by sentences with overlap"""
    from nltk.tokenize import sent_tokenize
    
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = []
    current_length = 0
    
    for sentence in sentences:
        sentence_length = len(sentence.split())
        
        if current_length + sentence_length > max_tokens:
            chunks.append(' '.join(current_chunk))
            # Overlap: keep last 2 sentences
            current_chunk = current_chunk[-2:] + [sentence]
            current_length = sum(len(s.split()) for s in current_chunk)
        else:
            current_chunk.append(sentence)
            current_length += sentence_length
    
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return chunks
# Combine vector similarity with keyword matching
from rank_bm25 import BM25Okapi

class HybridRAG(LocalRAG):
    def __init__(self, pdf_path):
        super().__init__(pdf_path)
        # Build BM25 index for keyword search
        self.bm25 = BM25Okapi([doc.split() for doc in self.texts])
    
    def ask(self, question):
        # Vector search (semantic)
        vector_results = super()._vector_search(question, n=5)
        
        # Keyword search (exact terms)
        keyword_scores = self.bm25.get_scores(question.split())
        keyword_results = sorted(enumerate(keyword_scores), key=lambda x: x[1], reverse=True)[:5]
        
        # Merge and re-rank
        combined = self._merge_results(vector_results, keyword_results)
        # ... rest of answer generation

Why hybrid: Catches both conceptual matches ("troubleshooting" finds "problem solving") and exact terms ("error code E404").


What You Learned

  • Local RAG needs 3 components: PDF parser, embeddings, LLM
  • Paragraph-based chunking beats fixed-size for technical docs
  • ChromaDB is zero-config and fast enough for <100K documents
  • Llama 3.1 8B gives GPT-3.5 quality responses locally

Limitations:

  • 8B models struggle with complex reasoning (use 70B if you have VRAM)
  • English-only (multilingual needs different embedding model)
  • No OCR (scanned PDFs need pytesseract preprocessing)

When NOT to use this:

  • Need GPT-4 level reasoning → Use cloud API with encrypted uploads
  • Have >1M documents → Consider Qdrant or Weaviate
  • Need sub-second responses → Add GPU acceleration

Cost Comparison

Local RAG (this solution):

  • Setup: $0 (open source)
  • Per query: $0
  • 1000 queries/month: $0

Cloud RAG (OpenAI + Pinecone):

  • Setup: $0
  • Per query: ~$0.002 (embeddings) + $0.02 (GPT-4) = $0.022
  • 1000 queries/month: $22

Break-even: Immediate for any consistent usage. Local also keeps your data private.


Troubleshooting

"CUDA out of memory" with Ollama:

# Use CPU mode (slower but stable)
OLLAMA_NUM_GPU=0 ollama run llama3.1:8b

ChromaDB "collection already exists":

# Add to __init__
self.client.delete_collection("docs")  # Clear old data
self.collection = self.client.create_collection("docs")

PDF text extraction fails:

# Fallback for scanned PDFs
try:
    text = page.get_text()
except:
    # Use OCR
    pix = page.get_pixmap()
    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
    text = pytesseract.image_to_string(img)

Tested on Python 3.11, Ollama 0.1.23, Ubuntu 22.04 & macOS 14 (Apple Silicon)

Hardware requirements:

  • CPU: 4+ cores recommended
  • RAM: 8GB minimum (16GB comfortable)
  • Disk: 10GB for models + index
  • GPU: Optional (3x faster with CUDA)