Problem: Your AI Hallucinates Because It Lacks Context
You're building an AI chatbot but GPT-4 keeps making up answers about your company's documentation, products, or internal data that it was never trained on.
You'll learn:
- How RAG retrieves relevant context before generating answers
- Setting up Pinecone for vector storage and semantic search
- Integrating OpenAI embeddings with real-time retrieval
Time: 45 min | Level: Intermediate
Why This Happens
LLMs like GPT-4 only know what they were trained on (data up to their cutoff date). When asked about your specific documentation, they guess instead of admitting ignorance.
Common symptoms:
- AI confidently states wrong information about your products
- Responses ignore recent documentation updates
- No way to cite sources for answers
The fix: RAG retrieves relevant chunks from your knowledge base before the LLM generates a response, grounding answers in real data.
Solution
Step 1: Install Dependencies
pip install pinecone-client openai tiktoken python-dotenv --break-system-packages
Expected: All packages install without errors
If it fails:
- Error: "externally-managed-environment": The
--break-system-packagesflag handles this in Python 3.11+
Step 2: Set Up Pinecone Index
Create a free account at pinecone.io and get your API key.
# setup_pinecone.py
import os
from pinecone import Pinecone, ServerlessSpec
from dotenv import load_dotenv
load_dotenv()
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
# Create index optimized for OpenAI embeddings
index_name = "knowledge-base"
if index_name not in pc.list_indexes().names():
pc.create_index(
name=index_name,
dimension=1536, # OpenAI text-embedding-3-small dimension
metric="cosine", # Best for semantic similarity
spec=ServerlessSpec(
cloud="aws",
region="us-east-1"
)
)
print(f"Index '{index_name}' ready")
Why cosine metric: Measures angle between vectors, perfect for semantic similarity regardless of text length.
Run it:
python setup_pinecone.py
Expected: "Index 'knowledge-base' ready" (takes ~60 seconds first time)
Step 3: Chunk and Embed Your Documents
# ingest_docs.py
import os
from openai import OpenAI
from pinecone import Pinecone
import tiktoken
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
index = pc.Index("knowledge-base")
def chunk_text(text, max_tokens=512):
"""Split text into chunks that fit embedding limits"""
encoding = tiktoken.get_encoding("cl100k_base")
tokens = encoding.encode(text)
chunks = []
for i in range(0, len(tokens), max_tokens):
chunk_tokens = tokens[i:i + max_tokens]
chunks.append(encoding.decode(chunk_tokens))
return chunks
def embed_and_store(doc_id, text, metadata={}):
"""Chunk document, create embeddings, store in Pinecone"""
chunks = chunk_text(text)
vectors = []
for i, chunk in enumerate(chunks):
# Generate embedding
response = client.embeddings.create(
model="text-embedding-3-small",
input=chunk
)
embedding = response.data[0].embedding
# Prepare vector with metadata
vectors.append({
"id": f"{doc_id}_chunk_{i}",
"values": embedding,
"metadata": {
"text": chunk,
"doc_id": doc_id,
"chunk_index": i,
**metadata
}
})
# Batch upsert for efficiency
index.upsert(vectors=vectors)
print(f"Stored {len(chunks)} chunks for {doc_id}")
# Example: Ingest your documentation
docs = {
"product_guide": """
Our API supports both REST and GraphQL endpoints.
Authentication uses Bearer tokens with 24-hour expiry.
Rate limits are 1000 requests per hour for free tier.
""",
"troubleshooting": """
If you get 429 errors, you've hit rate limits.
Wait 60 seconds or upgrade to Pro for 10,000 req/hour.
For 401 errors, regenerate your API token in settings.
"""
}
for doc_id, content in docs.items():
embed_and_store(
doc_id=doc_id,
text=content,
metadata={"source": "docs", "category": doc_id}
)
Why 512 tokens: Balances context quality with embedding speed. Smaller chunks = more precise retrieval.
Run it:
python ingest_docs.py
Expected: "Stored X chunks for product_guide" for each document
Step 4: Build the RAG Query Function
# rag_query.py
import os
from openai import OpenAI
from pinecone import Pinecone
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
index = pc.Index("knowledge-base")
def rag_query(question, top_k=3):
"""Query knowledge base and generate contextualized answer"""
# Step 1: Embed the question
response = client.embeddings.create(
model="text-embedding-3-small",
input=question
)
query_embedding = response.data[0].embedding
# Step 2: Retrieve relevant chunks
results = index.query(
vector=query_embedding,
top_k=top_k,
include_metadata=True
)
# Step 3: Build context from matches
context_chunks = [
match['metadata']['text']
for match in results['matches']
]
context = "\n\n".join(context_chunks)
# Step 4: Generate answer with retrieved context
prompt = f"""Answer the question based only on the following context.
If the context doesn't contain the answer, say "I don't have information about that."
Context:
{context}
Question: {question}
Answer:"""
completion = client.chat.completions.create(
model="gpt-4o-mini", # Fast and cheap for RAG
messages=[
{"role": "system", "content": "You are a helpful assistant that answers questions based on provided context. Always cite which part of the context you used."},
{"role": "user", "content": prompt}
],
temperature=0.3 # Lower temp = more factual, less creative
)
answer = completion.choices[0].message.content
# Return answer with sources for transparency
return {
"answer": answer,
"sources": [
{
"doc_id": match['metadata']['doc_id'],
"chunk": match['metadata']['text'][:100] + "...",
"score": match['score']
}
for match in results['matches']
]
}
# Test it
if __name__ == "__main__":
result = rag_query("What should I do if I get a 429 error?")
print("Answer:", result['answer'])
print("\nSources:")
for source in result['sources']:
print(f"- {source['doc_id']} (relevance: {source['score']:.2f})")
print(f" {source['chunk']}")
Why gpt-4o-mini: 60x cheaper than GPT-4 Turbo, perfect for RAG where context is already provided.
Step 5: Test the System
python rag_query.py
Expected output:
Answer: If you get a 429 error, you've hit rate limits. Wait 60 seconds or upgrade to Pro for 10,000 requests per hour.
Sources:
- troubleshooting (relevance: 0.89)
If you get 429 errors, you've hit rate limits. Wait 60 seconds or upgrade to Pro for 10,000 req/h...
If it fails:
- Empty results: Your index needs time to initialize (wait 2 min, retry)
- Low relevance scores (<0.7): Question doesn't match your docs, add more content
- OpenAI error: Check API key has credits, models are available in your region
Verification
Test with questions your docs should and shouldn't answer:
# Should answer (in your docs)
result1 = rag_query("What's the rate limit for free tier?")
print(result1['answer']) # Should mention 1000 req/hour
# Shouldn't answer (not in docs)
result2 = rag_query("What's your company's stock price?")
print(result2['answer']) # Should say "I don't have information about that"
You should see: Accurate answers for documented topics, honest refusal for others.
Production Improvements
Add Hybrid Search
Combine vector search with keyword matching for better precision:
# Requires Pinecone's sparse-dense vectors (available in paid tiers)
from pinecone_text.sparse import BM25Encoder
bm25 = BM25Encoder()
bm25.fit(corpus) # Your document texts
# Query with both dense and sparse vectors
results = index.query(
vector=dense_embedding,
sparse_vector=bm25.encode_queries(question),
top_k=5,
alpha=0.5 # Balance between semantic and keyword matching
)
Add Reranking
Use a cross-encoder to reorder results after retrieval:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# After retrieving top 10 chunks
scores = reranker.predict([(question, chunk) for chunk in chunks])
reranked_chunks = [chunks[i] for i in scores.argsort()[::-1][:3]]
Why rerank: Vector search is fast but imprecise. Reranking the top 10 with a more expensive model catches nuances.
Monitor Performance
import time
start = time.time()
result = rag_query(question)
latency = time.time() - start
# Log for analysis
print(f"Query latency: {latency:.2f}s")
print(f"Avg relevance: {sum(s['score'] for s in result['sources'])/len(result['sources']):.2f}")
Target metrics:
- Latency: <2 seconds end-to-end
- Relevance: >0.75 average score
- Coverage: >80% of questions get answered (not "I don't know")
What You Learned
- RAG solves hallucination by retrieving real data before generation
- Chunking text properly (512 tokens) balances context and precision
- Pinecone's cosine similarity finds semantically similar content, not just keywords
- Lower temperature (0.3) keeps LLM responses factual
Limitations:
- Quality depends on your source documents (garbage in, garbage out)
- Embedding costs scale with document count (~$0.10 per 1M tokens)
- Pinecone free tier limits to 1 index, 100K vectors
Tested on Python 3.11, OpenAI API v1.12.0, Pinecone v3.0.0, macOS & Ubuntu 24.04