Build RAG with Tables: Extract Data from PDFs and Excel 2026

RAG with tables from PDFs and Excel using Python, LangChain, and Chroma. Parse structured data with camelot, openpyxl, and embed table chunks for accurate retrieval.

Problem: RAG Fails Silently on Tables

RAG with tables from PDFs and Excel is one of the most common pain points in production retrieval pipelines. Standard text splitters shred table rows across chunks — your LLM gets fragment columns, misaligned headers, and numerical noise instead of structured data.

If you've ever asked a RAG system "What was Q3 revenue?" and got a hallucinated number, a broken table was likely the root cause.

You'll learn:

  • Extract tables from PDFs using camelot and pdfplumber with the right parser per file type
  • Parse Excel sheets into structured text chunks with openpyxl
  • Embed table-aware chunks into Chroma with LangChain for accurate retrieval
  • Verify end-to-end with a real query against financial table data

Time: 25 min | Difficulty: Intermediate


Why Table Extraction Breaks Standard RAG

Standard RAG pipelines use PyPDFLoader or similar text extractors. These tools read PDFs left-to-right, top-to-bottom — which destroys table structure. A three-column financial table becomes a stream of numbers with no relationship between cells and headers.

Symptoms:

  • LLM returns wrong numbers from financial reports
  • "I don't know" answers on questions that reference table data
  • Partial rows returned — headers and values separated across chunks
  • Excel .xlsx files parsed as flat text with no column context

The fix is to treat tables as a separate document type, extract them structurally, then serialize each table into a labeled text block before chunking.

RAG Table Extraction Pipeline: PDF/Excel → Table Parser → Serializer → Chunker → Chroma Vector DB → LLM End-to-end flow: each table becomes a self-contained labeled text block before embedding


Environment Setup

Python 3.12 + uv is the fastest path. These packages cover both PDF and Excel extraction.

# Create project with uv
uv init rag-tables && cd rag-tables
uv add langchain langchain-community langchain-openai \
       chromadb camelot-py[cv] pdfplumber openpyxl \
       pandas tiktoken python-dotenv

camelot-py[cv] pulls in OpenCV for lattice-mode table detection — required for bordered tables. pdfplumber handles stream-mode (borderless) tables more reliably.

Create .env:

OPENAI_API_KEY=sk-...

Solution

Step 1: Extract Tables from PDFs

PDFs have two table types. Lattice tables have visible grid lines; stream tables use whitespace alignment. Use the right parser for each.

# table_extractor.py
import camelot
import pdfplumber
import pandas as pd
from pathlib import Path


def extract_tables_camelot(pdf_path: str) -> list[pd.DataFrame]:
    """
    Lattice mode: bordered tables with visible grid lines.
    accuracy > 90 filters out partial/broken extractions.
    """
    tables = camelot.read_pdf(
        pdf_path,
        flavor="lattice",   # use "stream" for borderless tables
        pages="all",
    )
    return [t.df for t in tables if t.accuracy > 90]


def extract_tables_pdfplumber(pdf_path: str) -> list[pd.DataFrame]:
    """
    Stream mode: whitespace-aligned tables, scanned-adjacent PDFs.
    Falls back to pdfplumber when camelot accuracy is low.
    """
    dfs = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            for table in page.extract_tables():
                df = pd.DataFrame(table[1:], columns=table[0])
                dfs.append(df)
    return dfs


def extract_pdf_tables(pdf_path: str) -> list[pd.DataFrame]:
    """Try camelot first; fall back to pdfplumber."""
    try:
        tables = extract_tables_camelot(pdf_path)
        if tables:
            return tables
    except Exception:
        pass
    return extract_tables_pdfplumber(pdf_path)

Expected output: A list of Pandas DataFrames, one per detected table.

If it fails:

  • ImportError: cv2uv add opencv-python-headless
  • PDFInfoNotInstalledErrorbrew install poppler / apt-get install poppler-utils
  • Empty list from camelot → Switch flavor="stream" for borderless tables

Step 2: Extract Tables from Excel

Excel tables need column headers preserved and sheet names included as context — LLMs need that metadata to answer questions like "from the Sales sheet, what was Jan revenue?"

# excel_extractor.py
import openpyxl
import pandas as pd


def extract_excel_tables(xlsx_path: str) -> list[dict]:
    """
    Returns list of {sheet_name, table_name, df} dicts.
    Uses openpyxl named tables first; falls back to full sheet data.
    """
    wb = openpyxl.load_workbook(xlsx_path, data_only=True)  # data_only=True reads computed values, not formulas
    results = []

    for sheet_name in wb.sheetnames:
        ws = wb[sheet_name]

        if ws.tables:
            # Named Excel tables — best structure preservation
            for table_name, table in ws.tables.items():
                data = ws[table.ref]
                rows = [[cell.value for cell in row] for row in data]
                df = pd.DataFrame(rows[1:], columns=rows[0])
                results.append({
                    "sheet": sheet_name,
                    "table": table_name,
                    "df": df,
                })
        else:
            # Full sheet fallback
            df = pd.read_excel(xlsx_path, sheet_name=sheet_name, engine="openpyxl")
            results.append({
                "sheet": sheet_name,
                "table": sheet_name,
                "df": df,
            })

    return results

If it fails:

  • openpyxl.utils.exceptions.InvalidFileException → File is .xls not .xlsx. Convert first: pandas.read_excel(..., engine="xlrd")

Step 3: Serialize Tables into Retrieval-Friendly Text

Raw DataFrames can't go into a vector store. You need to serialize each table into a self-contained text block that includes the source, sheet/page context, column headers, and row data. This gives the LLM full context in a single retrieved chunk.

# serializer.py
import pandas as pd


def serialize_table(
    df: pd.DataFrame,
    source: str,
    table_label: str,
    max_rows: int = 50,
) -> str:
    """
    Convert a DataFrame into a labeled text block for embedding.
    max_rows=50 keeps chunks under ~512 tokens for most financial tables.
    """
    df = df.fillna("").astype(str)

    # Truncate large tables — embed summary row count for LLM context
    truncated = len(df) > max_rows
    display_df = df.head(max_rows)

    header = f"Source: {source}\nTable: {table_label}\n"
    col_line = "Columns: " + " | ".join(display_df.columns.tolist()) + "\n"
    rows = "\n".join(
        " | ".join(row) for row in display_df.itertuples(index=False, name=None)
    )
    footer = f"\n[Truncated: showing {max_rows} of {len(df)} rows]" if truncated else ""

    return header + col_line + rows + footer

Example output for a revenue table:

Source: annual_report_2025.pdf
Table: Page 4 Table 1
Columns: Quarter | Revenue USD | YoY Growth
Q1 2025 | 4200000 | 12%
Q2 2025 | 4850000 | 15%
Q3 2025 | 5100000 | 18%
Q4 2025 | 6200000 | 21%

This block is one chunk — headers and values stay together.


Step 4: Embed and Index into Chroma

Now wrap everything into LangChain Document objects and push to Chroma.

# indexer.py
from langchain_core.documents import Document
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from table_extractor import extract_pdf_tables
from excel_extractor import extract_excel_tables
from serializer import serialize_table
from pathlib import Path
import os
from dotenv import load_dotenv

load_dotenv()


def build_table_index(
    pdf_paths: list[str],
    xlsx_paths: list[str],
    persist_dir: str = "./chroma_tables",
) -> Chroma:
    docs = []

    # PDF tables
    for pdf_path in pdf_paths:
        tables = extract_pdf_tables(pdf_path)
        for i, df in enumerate(tables):
            text = serialize_table(df, source=Path(pdf_path).name, table_label=f"Table {i + 1}")
            docs.append(Document(
                page_content=text,
                metadata={"source": pdf_path, "type": "pdf_table", "table_index": i},
            ))

    # Excel tables
    for xlsx_path in xlsx_paths:
        tables = extract_excel_tables(xlsx_path)
        for t in tables:
            text = serialize_table(t["df"], source=Path(xlsx_path).name, table_label=t["table"])
            docs.append(Document(
                page_content=text,
                metadata={"source": xlsx_path, "type": "excel_table", "sheet": t["sheet"]},
            ))

    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")  # $0.02/1M tokens — cheapest option for table text
    vectorstore = Chroma.from_documents(
        documents=docs,
        embedding=embeddings,
        persist_directory=persist_dir,
    )
    return vectorstore

Step 5: Query the Table Index

Wire up a retrieval chain with a table-aware system prompt.

# query.py
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings


def load_index(persist_dir: str = "./chroma_tables") -> Chroma:
    return Chroma(
        persist_directory=persist_dir,
        embedding_function=OpenAIEmbeddings(model="text-embedding-3-small"),
    )


TABLE_PROMPT = ChatPromptTemplate.from_messages([
    ("system", (
        "You are a data analyst. Answer questions using ONLY the table data provided. "
        "Quote exact values from the table. If a value is missing, say so explicitly."
    )),
    ("human", "Context:\n{context}\n\nQuestion: {question}"),
])


def ask(question: str, vectorstore: Chroma) -> str:
    retriever = vectorstore.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 4},  # 4 chunks covers most multi-table questions without exceeding context
    )
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)  # temperature=0 — deterministic for numerical answers

    chain = (
        {"context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
         "question": RunnablePassthrough()}
        | TABLE_PROMPT
        | llm
    )
    return chain.invoke(question).content


if __name__ == "__main__":
    vs = load_index()
    print(ask("What was Q3 2025 revenue?", vs))

Verification

uv run python -c "
from indexer import build_table_index
vs = build_table_index(['annual_report_2025.pdf'], ['financials.xlsx'])
print(f'Indexed {vs._collection.count()} table chunks')
"

You should see: Indexed 12 table chunks (number varies by file)

uv run python query.py

You should see: Q3 2025 revenue was $5,100,000, representing 18% YoY growth.


Comparison: Table Extraction Libraries

LibraryBest forAccuracySpeed
camelot latticeBordered PDF tables⭐⭐⭐⭐⭐Slow
camelot streamWhitespace-aligned PDF⭐⭐⭐Medium
pdfplumberScanned-adjacent, complex PDFs⭐⭐⭐⭐Fast
openpyxlNamed Excel tables⭐⭐⭐⭐⭐Fast
pandas.read_excelFlat sheet data⭐⭐⭐Fast

Use camelot lattice for annual reports with ruled grids. Switch to pdfplumber for government PDFs, research papers, and anything that was originally a Word document exported to PDF.


What You Learned

  • Tables must be extracted structurally — never via flat text parsers — or RAG will silently return wrong numbers
  • camelot lattice mode gives the highest accuracy on bordered tables; pdfplumber is the reliable fallback
  • Serializing tables with source + label + column headers as a single chunk keeps all relational context in one retrieval unit
  • temperature=0 on the LLM is non-negotiable for numerical table queries

Tested on Python 3.12, LangChain 0.3, Chroma 0.6, camelot-py 0.11, openpyxl 3.1, Ubuntu 24.04 & macOS Sequoia


FAQ

Q: Does camelot work on scanned PDFs? A: No. Scanned PDFs need OCR pre-processing — run pytesseract or AWS Textract first, then parse the output. Camelot requires machine-readable PDF text layers.

Q: What is the difference between camelot lattice and stream mode? A: Lattice detects table borders using image processing and is highly accurate on ruled grids. Stream infers columns from whitespace gaps — works on borderless tables but requires tuning edge_tol and row_tol for complex layouts.

Q: How many rows can I embed before hitting token limits? A: text-embedding-3-small supports 8,191 tokens. A 50-row table with 5 columns averages ~400 tokens — well within limits. For tables over 200 rows, split by logical section or date range rather than arbitrary row count.

Q: Can this work with multimodal models instead of text extraction? A: Yes — GPT-4o and Claude 3.5 can read table images directly. This is useful for complex merged-cell tables where parsers fail. The tradeoff is cost (~10× higher per page) and latency. Use vision as a fallback, not primary path.