Problem: Your Company Docs Can't Leave Your Network
You need to chat with internal PDFs (technical specs, manuals, contracts) but can't send them to OpenAI or Anthropic for compliance reasons.
You'll learn:
- How to extract and chunk PDF text properly
- Local embedding generation with sentence-transformers
- Vector search without Pinecone or cloud databases
- Running Llama 3 locally for answers
Time: 20 min | Level: Intermediate
Why This Happens
Cloud RAG services require uploading your documents to third-party servers. For regulated industries (healthcare, finance, defense) or proprietary docs, that's a non-starter.
Common symptoms:
- Can't use ChatGPT plugins for internal docs
- Legal blocks on cloud AI services
- Need offline operation for secure environments
- Want to avoid per-query API costs
Solution
Step 1: Install Dependencies
# Create isolated environment
python3 -m venv rag_env
source rag_env/bin/activate # Windows: rag_env\Scripts\activate
# Core libraries
pip install --break-system-packages \
pymupdf==1.24.0 \
sentence-transformers==2.3.1 \
chromadb==0.4.22 \
ollama==0.1.7
Why these?
pymupdf: Fast PDF text extraction, handles complex layoutssentence-transformers: Local embeddings (no API)chromadb: Embedded vector DB, zero configollama: Local LLM runtime
Expected: Installation completes in 2-3 minutes. Sentence-transformers downloads ~500MB model on first run.
Step 2: Install Local LLM
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull Llama 3.1 8B (5GB download)
ollama pull llama3.1:8b
# Verify it works
ollama run llama3.1:8b "Hello"
If it fails:
- Windows: Download installer from ollama.com/download
- GPU not detected: Add
CUDA_VISIBLE_DEVICES=0to environment - Out of memory: Use
llama3.1:3binstead (smaller model)
Step 3: Create the RAG Pipeline
# rag_chat.py
import fitz # pymupdf
from sentence_transformers import SentenceTransformer
import chromadb
import ollama
class LocalRAG:
def __init__(self, pdf_path):
# Load embedding model (runs on CPU, ~2 sec)
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
# Initialize vector store
self.client = chromadb.Client()
self.collection = self.client.create_collection("docs")
# Process PDF
self._ingest_pdf(pdf_path)
def _ingest_pdf(self, pdf_path):
"""Extract text and create searchable chunks"""
doc = fitz.open(pdf_path)
chunks = []
for page_num, page in enumerate(doc):
text = page.get_text()
# Chunk by paragraph (better than fixed size)
paragraphs = [p.strip() for p in text.split('\n\n') if len(p.strip()) > 50]
for i, para in enumerate(paragraphs):
chunks.append({
'text': para,
'metadata': {'page': page_num + 1, 'chunk': i}
})
# Generate embeddings
texts = [c['text'] for c in chunks]
embeddings = self.embedder.encode(texts, show_progress_bar=True)
# Store in vector DB
self.collection.add(
embeddings=embeddings.tolist(),
documents=texts,
ids=[f"chunk_{i}" for i in range(len(texts))],
metadatas=[c['metadata'] for c in chunks]
)
print(f"✓ Indexed {len(chunks)} chunks from {len(doc)} pages")
def ask(self, question):
"""Query the RAG system"""
# Find relevant chunks
query_embedding = self.embedder.encode([question])[0]
results = self.collection.query(
query_embeddings=[query_embedding.tolist()],
n_results=3 # Top 3 most relevant chunks
)
# Build context from retrieved chunks
context = "\n\n".join(results['documents'][0])
# Generate answer with local LLM
prompt = f"""Based on this documentation:
{context}
Question: {question}
Answer concisely using only the information above. If the answer isn't in the docs, say so."""
response = ollama.generate(
model='llama3.1:8b',
prompt=prompt,
stream=False
)
return {
'answer': response['response'],
'sources': [f"Page {m['page']}" for m in results['metadatas'][0]]
}
# Usage
if __name__ == "__main__":
rag = LocalRAG("technical_manual.pdf")
result = rag.ask("What are the safety precautions?")
print(f"Answer: {result['answer']}")
print(f"Sources: {', '.join(result['sources'])}")
Why this architecture:
- Paragraph chunking preserves semantic meaning (better than 512-char splits)
- all-MiniLM-L6-v2 model is fast (50ms per embedding) and accurate for retrieval
- ChromaDB persists to disk automatically
- 3 chunks balance context size vs relevance
Step 4: Test It
python rag_chat.py
You should see:
✓ Indexed 247 chunks from 45 pages
Answer: The manual lists three main precautions: 1) Wear protective equipment...
Sources: Page 12, Page 13, Page 15
Performance expectations:
- PDF ingestion: ~2 seconds per page
- Query response: 3-5 seconds (embedding + LLM generation)
- Memory usage: ~4GB (LLM model + embeddings)
Step 5: Add Simple CLI Interface
# chat.py
from rag_chat import LocalRAG
import sys
def main():
if len(sys.argv) < 2:
print("Usage: python chat.py <pdf_file>")
sys.exit(1)
rag = LocalRAG(sys.argv[1])
print("\nChat with your PDF (type 'quit' to exit)\n")
while True:
question = input("You: ").strip()
if question.lower() in ['quit', 'exit']:
break
if not question:
continue
result = rag.ask(question)
print(f"\nAssistant: {result['answer']}")
print(f"📄 Sources: {', '.join(result['sources'])}\n")
if __name__ == "__main__":
main()
python chat.py documentation.pdf
Interactive experience:
You: How do I reset the admin password?
Assistant: Navigate to Settings > Security > Reset Password...
📄 Sources: Page 34, Page 35
Verification
Accuracy test:
# Ask a question you know the answer to from the PDF
python chat.py manual.pdf
You: What is the warranty period?
You should see: Correct answer citing specific pages. If answer is wrong:
- Too generic: Reduce
n_resultsto 2 for more focused context - Hallucinating: Add
temperature=0.3to ollama.generate() for deterministic responses - Wrong pages: Check PDF text extraction with
pdftotext manual.pdfto verify readability
Production Improvements
Better Chunking Strategy
def _smart_chunk(self, text, max_tokens=512):
"""Chunk by sentences with overlap"""
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)
chunks = []
current_chunk = []
current_length = 0
for sentence in sentences:
sentence_length = len(sentence.split())
if current_length + sentence_length > max_tokens:
chunks.append(' '.join(current_chunk))
# Overlap: keep last 2 sentences
current_chunk = current_chunk[-2:] + [sentence]
current_length = sum(len(s.split()) for s in current_chunk)
else:
current_chunk.append(sentence)
current_length += sentence_length
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks
Add Hybrid Search
# Combine vector similarity with keyword matching
from rank_bm25 import BM25Okapi
class HybridRAG(LocalRAG):
def __init__(self, pdf_path):
super().__init__(pdf_path)
# Build BM25 index for keyword search
self.bm25 = BM25Okapi([doc.split() for doc in self.texts])
def ask(self, question):
# Vector search (semantic)
vector_results = super()._vector_search(question, n=5)
# Keyword search (exact terms)
keyword_scores = self.bm25.get_scores(question.split())
keyword_results = sorted(enumerate(keyword_scores), key=lambda x: x[1], reverse=True)[:5]
# Merge and re-rank
combined = self._merge_results(vector_results, keyword_results)
# ... rest of answer generation
Why hybrid: Catches both conceptual matches ("troubleshooting" finds "problem solving") and exact terms ("error code E404").
What You Learned
- Local RAG needs 3 components: PDF parser, embeddings, LLM
- Paragraph-based chunking beats fixed-size for technical docs
- ChromaDB is zero-config and fast enough for <100K documents
- Llama 3.1 8B gives GPT-3.5 quality responses locally
Limitations:
- 8B models struggle with complex reasoning (use 70B if you have VRAM)
- English-only (multilingual needs different embedding model)
- No OCR (scanned PDFs need pytesseract preprocessing)
When NOT to use this:
- Need GPT-4 level reasoning → Use cloud API with encrypted uploads
- Have >1M documents → Consider Qdrant or Weaviate
- Need sub-second responses → Add GPU acceleration
Cost Comparison
Local RAG (this solution):
- Setup: $0 (open source)
- Per query: $0
- 1000 queries/month: $0
Cloud RAG (OpenAI + Pinecone):
- Setup: $0
- Per query: ~$0.002 (embeddings) + $0.02 (GPT-4) = $0.022
- 1000 queries/month: $22
Break-even: Immediate for any consistent usage. Local also keeps your data private.
Troubleshooting
"CUDA out of memory" with Ollama:
# Use CPU mode (slower but stable)
OLLAMA_NUM_GPU=0 ollama run llama3.1:8b
ChromaDB "collection already exists":
# Add to __init__
self.client.delete_collection("docs") # Clear old data
self.collection = self.client.create_collection("docs")
PDF text extraction fails:
# Fallback for scanned PDFs
try:
text = page.get_text()
except:
# Use OCR
pix = page.get_pixmap()
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
text = pytesseract.image_to_string(img)
Tested on Python 3.11, Ollama 0.1.23, Ubuntu 22.04 & macOS 14 (Apple Silicon)
Hardware requirements:
- CPU: 4+ cores recommended
- RAM: 8GB minimum (16GB comfortable)
- Disk: 10GB for models + index
- GPU: Optional (3x faster with CUDA)