Problem: Your AI Hallucinates Because It Lacks Context

You're building an AI chatbot but GPT-4 keeps making up answers about your company's documentation, products, or internal data that it was never trained on.

You'll learn:

How RAG retrieves relevant context before generating answers
Setting up Pinecone for vector storage and semantic search
Integrating OpenAI embeddings with real-time retrieval

Time: 45 min | Level: Intermediate

Why This Happens

LLMs like GPT-4 only know what they were trained on (data up to their cutoff date). When asked about your specific documentation, they guess instead of admitting ignorance.

Common symptoms:

AI confidently states wrong information about your products
Responses ignore recent documentation updates
No way to cite sources for answers

The fix: RAG retrieves relevant chunks from your knowledge base before the LLM generates a response, grounding answers in real data.

Solution

Step 1: Install Dependencies

pip install pinecone-client openai tiktoken python-dotenv --break-system-packages

Expected: All packages install without errors

If it fails:

Error: "externally-managed-environment": The --break-system-packages flag handles this in Python 3.11+

Step 2: Set Up Pinecone Index

Create a free account at pinecone.io and get your API key.

# setup_pinecone.py
import os
from pinecone import Pinecone, ServerlessSpec
from dotenv import load_dotenv

load_dotenv()

pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

# Create index optimized for OpenAI embeddings
index_name = "knowledge-base"

if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,  # OpenAI text-embedding-3-small dimension
        metric="cosine",  # Best for semantic similarity
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        )
    )

print(f"Index '{index_name}' ready")

Why cosine metric: Measures angle between vectors, perfect for semantic similarity regardless of text length.

Run it:

python setup_pinecone.py

Expected: "Index 'knowledge-base' ready" (takes ~60 seconds first time)

Step 3: Chunk and Embed Your Documents

# ingest_docs.py
import os
from openai import OpenAI
from pinecone import Pinecone
import tiktoken

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
index = pc.Index("knowledge-base")

def chunk_text(text, max_tokens=512):
    """Split text into chunks that fit embedding limits"""
    encoding = tiktoken.get_encoding("cl100k_base")
    tokens = encoding.encode(text)
    
    chunks = []
    for i in range(0, len(tokens), max_tokens):
        chunk_tokens = tokens[i:i + max_tokens]
        chunks.append(encoding.decode(chunk_tokens))
    
    return chunks

def embed_and_store(doc_id, text, metadata={}):
    """Chunk document, create embeddings, store in Pinecone"""
    chunks = chunk_text(text)
    
    vectors = []
    for i, chunk in enumerate(chunks):
        # Generate embedding
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=chunk
        )
        embedding = response.data[0].embedding
        
        # Prepare vector with metadata
        vectors.append({
            "id": f"{doc_id}_chunk_{i}",
            "values": embedding,
            "metadata": {
                "text": chunk,
                "doc_id": doc_id,
                "chunk_index": i,
                **metadata
            }
        })
    
    # Batch upsert for efficiency
    index.upsert(vectors=vectors)
    print(f"Stored {len(chunks)} chunks for {doc_id}")

# Example: Ingest your documentation
docs = {
    "product_guide": """
    Our API supports both REST and GraphQL endpoints.
    Authentication uses Bearer tokens with 24-hour expiry.
    Rate limits are 1000 requests per hour for free tier.
    """,
    "troubleshooting": """
    If you get 429 errors, you've hit rate limits.
    Wait 60 seconds or upgrade to Pro for 10,000 req/hour.
    For 401 errors, regenerate your API token in settings.
    """
}

for doc_id, content in docs.items():
    embed_and_store(
        doc_id=doc_id,
        text=content,
        metadata={"source": "docs", "category": doc_id}
    )

Why 512 tokens: Balances context quality with embedding speed. Smaller chunks = more precise retrieval.

Run it:

python ingest_docs.py

Expected: "Stored X chunks for product_guide" for each document

Step 4: Build the RAG Query Function

# rag_query.py
import os
from openai import OpenAI
from pinecone import Pinecone

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
index = pc.Index("knowledge-base")

def rag_query(question, top_k=3):
    """Query knowledge base and generate contextualized answer"""
    
    # Step 1: Embed the question
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=question
    )
    query_embedding = response.data[0].embedding
    
    # Step 2: Retrieve relevant chunks
    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        include_metadata=True
    )
    
    # Step 3: Build context from matches
    context_chunks = [
        match['metadata']['text'] 
        for match in results['matches']
    ]
    context = "\n\n".join(context_chunks)
    
    # Step 4: Generate answer with retrieved context
    prompt = f"""Answer the question based only on the following context.
If the context doesn't contain the answer, say "I don't have information about that."

Context:
{context}

Question: {question}

Answer:"""
    
    completion = client.chat.completions.create(
        model="gpt-4o-mini",  # Fast and cheap for RAG
        messages=[
            {"role": "system", "content": "You are a helpful assistant that answers questions based on provided context. Always cite which part of the context you used."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.3  # Lower temp = more factual, less creative
    )
    
    answer = completion.choices[0].message.content
    
    # Return answer with sources for transparency
    return {
        "answer": answer,
        "sources": [
            {
                "doc_id": match['metadata']['doc_id'],
                "chunk": match['metadata']['text'][:100] + "...",
                "score": match['score']
            }
            for match in results['matches']
        ]
    }

# Test it
if __name__ == "__main__":
    result = rag_query("What should I do if I get a 429 error?")
    
    print("Answer:", result['answer'])
    print("\nSources:")
    for source in result['sources']:
        print(f"- {source['doc_id']} (relevance: {source['score']:.2f})")
        print(f"  {source['chunk']}")

Why gpt-4o-mini: 60x cheaper than GPT-4 Turbo, perfect for RAG where context is already provided.

Step 5: Test the System

python rag_query.py

Expected output:

Answer: If you get a 429 error, you've hit rate limits. Wait 60 seconds or upgrade to Pro for 10,000 requests per hour.

Sources:
- troubleshooting (relevance: 0.89)
  If you get 429 errors, you've hit rate limits. Wait 60 seconds or upgrade to Pro for 10,000 req/h...

If it fails:

Empty results: Your index needs time to initialize (wait 2 min, retry)
Low relevance scores (<0.7): Question doesn't match your docs, add more content
OpenAI error: Check API key has credits, models are available in your region

Verification

Test with questions your docs should and shouldn't answer:

# Should answer (in your docs)
result1 = rag_query("What's the rate limit for free tier?")
print(result1['answer'])  # Should mention 1000 req/hour

# Shouldn't answer (not in docs)
result2 = rag_query("What's your company's stock price?")
print(result2['answer'])  # Should say "I don't have information about that"

You should see: Accurate answers for documented topics, honest refusal for others.

Production Improvements

Add Hybrid Search

Combine vector search with keyword matching for better precision:

# Requires Pinecone's sparse-dense vectors (available in paid tiers)
from pinecone_text.sparse import BM25Encoder

bm25 = BM25Encoder()
bm25.fit(corpus)  # Your document texts

# Query with both dense and sparse vectors
results = index.query(
    vector=dense_embedding,
    sparse_vector=bm25.encode_queries(question),
    top_k=5,
    alpha=0.5  # Balance between semantic and keyword matching
)

Add Reranking

Use a cross-encoder to reorder results after retrieval:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# After retrieving top 10 chunks
scores = reranker.predict([(question, chunk) for chunk in chunks])
reranked_chunks = [chunks[i] for i in scores.argsort()[::-1][:3]]

Why rerank: Vector search is fast but imprecise. Reranking the top 10 with a more expensive model catches nuances.

Monitor Performance

import time

start = time.time()
result = rag_query(question)
latency = time.time() - start

# Log for analysis
print(f"Query latency: {latency:.2f}s")
print(f"Avg relevance: {sum(s['score'] for s in result['sources'])/len(result['sources']):.2f}")

Target metrics:

Latency: <2 seconds end-to-end
Relevance: >0.75 average score
Coverage: >80% of questions get answered (not "I don't know")

What You Learned

RAG solves hallucination by retrieving real data before generation
Chunking text properly (512 tokens) balances context and precision
Pinecone's cosine similarity finds semantically similar content, not just keywords
Lower temperature (0.3) keeps LLM responses factual

Limitations:

Quality depends on your source documents (garbage in, garbage out)
Embedding costs scale with document count (~$0.10 per 1M tokens)
Pinecone free tier limits to 1 index, 100K vectors

Tested on Python 3.11, OpenAI API v1.12.0, Pinecone v3.0.0, macOS & Ubuntu 24.04