Build a RAG System with Python and Pinecone in 45 Minutes

Create a production-ready retrieval-augmented generation system using Python, OpenAI, and Pinecone for accurate, context-aware AI responses.

Problem: Your AI Hallucinates Because It Lacks Context

You're building an AI chatbot but GPT-4 keeps making up answers about your company's documentation, products, or internal data that it was never trained on.

You'll learn:

  • How RAG retrieves relevant context before generating answers
  • Setting up Pinecone for vector storage and semantic search
  • Integrating OpenAI embeddings with real-time retrieval

Time: 45 min | Level: Intermediate


Why This Happens

LLMs like GPT-4 only know what they were trained on (data up to their cutoff date). When asked about your specific documentation, they guess instead of admitting ignorance.

Common symptoms:

  • AI confidently states wrong information about your products
  • Responses ignore recent documentation updates
  • No way to cite sources for answers

The fix: RAG retrieves relevant chunks from your knowledge base before the LLM generates a response, grounding answers in real data.


Solution

Step 1: Install Dependencies

pip install pinecone-client openai tiktoken python-dotenv --break-system-packages

Expected: All packages install without errors

If it fails:

  • Error: "externally-managed-environment": The --break-system-packages flag handles this in Python 3.11+

Step 2: Set Up Pinecone Index

Create a free account at pinecone.io and get your API key.

# setup_pinecone.py
import os
from pinecone import Pinecone, ServerlessSpec
from dotenv import load_dotenv

load_dotenv()

pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

# Create index optimized for OpenAI embeddings
index_name = "knowledge-base"

if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,  # OpenAI text-embedding-3-small dimension
        metric="cosine",  # Best for semantic similarity
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        )
    )

print(f"Index '{index_name}' ready")

Why cosine metric: Measures angle between vectors, perfect for semantic similarity regardless of text length.

Run it:

python setup_pinecone.py

Expected: "Index 'knowledge-base' ready" (takes ~60 seconds first time)


Step 3: Chunk and Embed Your Documents

# ingest_docs.py
import os
from openai import OpenAI
from pinecone import Pinecone
import tiktoken

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
index = pc.Index("knowledge-base")

def chunk_text(text, max_tokens=512):
    """Split text into chunks that fit embedding limits"""
    encoding = tiktoken.get_encoding("cl100k_base")
    tokens = encoding.encode(text)
    
    chunks = []
    for i in range(0, len(tokens), max_tokens):
        chunk_tokens = tokens[i:i + max_tokens]
        chunks.append(encoding.decode(chunk_tokens))
    
    return chunks

def embed_and_store(doc_id, text, metadata={}):
    """Chunk document, create embeddings, store in Pinecone"""
    chunks = chunk_text(text)
    
    vectors = []
    for i, chunk in enumerate(chunks):
        # Generate embedding
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=chunk
        )
        embedding = response.data[0].embedding
        
        # Prepare vector with metadata
        vectors.append({
            "id": f"{doc_id}_chunk_{i}",
            "values": embedding,
            "metadata": {
                "text": chunk,
                "doc_id": doc_id,
                "chunk_index": i,
                **metadata
            }
        })
    
    # Batch upsert for efficiency
    index.upsert(vectors=vectors)
    print(f"Stored {len(chunks)} chunks for {doc_id}")

# Example: Ingest your documentation
docs = {
    "product_guide": """
    Our API supports both REST and GraphQL endpoints.
    Authentication uses Bearer tokens with 24-hour expiry.
    Rate limits are 1000 requests per hour for free tier.
    """,
    "troubleshooting": """
    If you get 429 errors, you've hit rate limits.
    Wait 60 seconds or upgrade to Pro for 10,000 req/hour.
    For 401 errors, regenerate your API token in settings.
    """
}

for doc_id, content in docs.items():
    embed_and_store(
        doc_id=doc_id,
        text=content,
        metadata={"source": "docs", "category": doc_id}
    )

Why 512 tokens: Balances context quality with embedding speed. Smaller chunks = more precise retrieval.

Run it:

python ingest_docs.py

Expected: "Stored X chunks for product_guide" for each document


Step 4: Build the RAG Query Function

# rag_query.py
import os
from openai import OpenAI
from pinecone import Pinecone

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
index = pc.Index("knowledge-base")

def rag_query(question, top_k=3):
    """Query knowledge base and generate contextualized answer"""
    
    # Step 1: Embed the question
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=question
    )
    query_embedding = response.data[0].embedding
    
    # Step 2: Retrieve relevant chunks
    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        include_metadata=True
    )
    
    # Step 3: Build context from matches
    context_chunks = [
        match['metadata']['text'] 
        for match in results['matches']
    ]
    context = "\n\n".join(context_chunks)
    
    # Step 4: Generate answer with retrieved context
    prompt = f"""Answer the question based only on the following context.
If the context doesn't contain the answer, say "I don't have information about that."

Context:
{context}

Question: {question}

Answer:"""
    
    completion = client.chat.completions.create(
        model="gpt-4o-mini",  # Fast and cheap for RAG
        messages=[
            {"role": "system", "content": "You are a helpful assistant that answers questions based on provided context. Always cite which part of the context you used."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.3  # Lower temp = more factual, less creative
    )
    
    answer = completion.choices[0].message.content
    
    # Return answer with sources for transparency
    return {
        "answer": answer,
        "sources": [
            {
                "doc_id": match['metadata']['doc_id'],
                "chunk": match['metadata']['text'][:100] + "...",
                "score": match['score']
            }
            for match in results['matches']
        ]
    }

# Test it
if __name__ == "__main__":
    result = rag_query("What should I do if I get a 429 error?")
    
    print("Answer:", result['answer'])
    print("\nSources:")
    for source in result['sources']:
        print(f"- {source['doc_id']} (relevance: {source['score']:.2f})")
        print(f"  {source['chunk']}")

Why gpt-4o-mini: 60x cheaper than GPT-4 Turbo, perfect for RAG where context is already provided.


Step 5: Test the System

python rag_query.py

Expected output:

Answer: If you get a 429 error, you've hit rate limits. Wait 60 seconds or upgrade to Pro for 10,000 requests per hour.

Sources:
- troubleshooting (relevance: 0.89)
  If you get 429 errors, you've hit rate limits. Wait 60 seconds or upgrade to Pro for 10,000 req/h...

If it fails:

  • Empty results: Your index needs time to initialize (wait 2 min, retry)
  • Low relevance scores (<0.7): Question doesn't match your docs, add more content
  • OpenAI error: Check API key has credits, models are available in your region

Verification

Test with questions your docs should and shouldn't answer:

# Should answer (in your docs)
result1 = rag_query("What's the rate limit for free tier?")
print(result1['answer'])  # Should mention 1000 req/hour

# Shouldn't answer (not in docs)
result2 = rag_query("What's your company's stock price?")
print(result2['answer'])  # Should say "I don't have information about that"

You should see: Accurate answers for documented topics, honest refusal for others.


Production Improvements

Combine vector search with keyword matching for better precision:

# Requires Pinecone's sparse-dense vectors (available in paid tiers)
from pinecone_text.sparse import BM25Encoder

bm25 = BM25Encoder()
bm25.fit(corpus)  # Your document texts

# Query with both dense and sparse vectors
results = index.query(
    vector=dense_embedding,
    sparse_vector=bm25.encode_queries(question),
    top_k=5,
    alpha=0.5  # Balance between semantic and keyword matching
)

Add Reranking

Use a cross-encoder to reorder results after retrieval:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# After retrieving top 10 chunks
scores = reranker.predict([(question, chunk) for chunk in chunks])
reranked_chunks = [chunks[i] for i in scores.argsort()[::-1][:3]]

Why rerank: Vector search is fast but imprecise. Reranking the top 10 with a more expensive model catches nuances.

Monitor Performance

import time

start = time.time()
result = rag_query(question)
latency = time.time() - start

# Log for analysis
print(f"Query latency: {latency:.2f}s")
print(f"Avg relevance: {sum(s['score'] for s in result['sources'])/len(result['sources']):.2f}")

Target metrics:

  • Latency: <2 seconds end-to-end
  • Relevance: >0.75 average score
  • Coverage: >80% of questions get answered (not "I don't know")

What You Learned

  • RAG solves hallucination by retrieving real data before generation
  • Chunking text properly (512 tokens) balances context and precision
  • Pinecone's cosine similarity finds semantically similar content, not just keywords
  • Lower temperature (0.3) keeps LLM responses factual

Limitations:

  • Quality depends on your source documents (garbage in, garbage out)
  • Embedding costs scale with document count (~$0.10 per 1M tokens)
  • Pinecone free tier limits to 1 index, 100K vectors

Tested on Python 3.11, OpenAI API v1.12.0, Pinecone v3.0.0, macOS & Ubuntu 24.04