Chat with Your Database: Building a Safe, Accurate Text-to-SQL AI Interface

Your CEO asks 'how many enterprise customers churned last quarter?' Your analyst takes 2 hours to write the query. An LLM generates it in 1.8 seconds — and a validation layer makes sure it only runs SELECT.

That's the promise of a text-to-SQL interface: instant, natural language access to your data, without handing a SQL console to a stochastic parrot that might DROP TABLE production. The reality is a messy intersection of prompt engineering, SQL validation, and Python tooling that either works beautifully or silently returns plausible, incorrect answers. With Python being the #1 most-used language for 4 consecutive years (Stack Overflow 2025), it's the logical battlefield for this problem. Let's build a system that's safe, accurate, and doesn't make your DBA quit.

The Four-Layer Text-to-SQL Pipeline: More Than Just a Prompt

A naive implementation is one function: prompt -> LLM -> execute. This is how you get fired. A robust system needs four distinct, testable layers, each with a Python-specific job.

Schema Context Injection: Gather and format relevant database schema details.
Query Generation: The LLM call itself, crafting the prompt.
Safety & Validation: The gatekeeper that blocks malicious or malformed SQL.
Execution & Summarization: Running the safe query and translating rows into an answer.

We'll use FastAPI (used by 42% of new Python API projects (JetBrains Dev Ecosystem 2025)) for the web layer, SQLAlchemy for dialect-agnostic core, and Pydantic v2 for ruthless data validation. Let's start with the foundation.


from sqlalchemy import MetaData, create_engine, text
from sqlalchemy.engine import Engine
from pydantic import BaseModel, Field
from typing import Optional
import logging

logger = logging.getLogger(__name__)

class DatabaseConfig(BaseModel):
    """Pydantic model for safe DB connection config."""
    connection_string: str = Field(..., min_length=1)
    pool_size: int = Field(default=5, ge=1)
    echo: bool = False

    model_config = {
        "frozen": True  # Immutable config helps prevent accidental mutation
    }

class SchemaContextFetcher:
    """Responsible for Layer 1: Extracting and formatting schema context."""
    def __init__(self, engine: Engine):
        self.engine = engine
        self.metadata = MetaData()

    def get_relevant_context(self, user_question: str) -> str:
        """
        Fetches schema info. A naive version gets everything.
        A production system would use a vector store or keyword matching.
        """
        # Reflect all tables (expensive, but simple for our example).
        # In reality, you'd filter based on `user_question`.
        self.metadata.reflect(bind=self.engine)
        context_parts = ["Database Schema:"]
        for table_name, table in self.metadata.tables.items():
            columns = [f"{col.name} ({str(col.type)})" for col in table.columns]
            context_parts.append(f"Table '{table_name}': {', '.join(columns)}")
        return "\n".join(context_parts)

# Example instantiation and a common error
config = DatabaseConfig(connection_string="postgresql://user:pass@localhost/db")
try:
    engine = create_engine(**config.model_dump())
    fetcher = SchemaContextFetcher(engine)
except ModuleNotFoundError: No module named 'psycopg2'
# FIX: Check venv activation and pip install in correct env.
# Using `uv`: `uv add psycopg2-binary` in your activated environment.

How Much Schema is Too Much? The Context Window Tax

Throwing your entire 500-table schema into every prompt is a great way to burn money and get worse results. LLM context windows are finite, and irrelevant tables introduce noise. The goal is relevant context.

A simple but effective strategy: use the information_schema to find tables with column names that lexically match keywords in the user's question. For "enterprise customers churned," you'd look for tables with columns like customer, enterprise, churn_date, status. You inject only those table definitions.

More advanced setups use a separate, smaller embedding model (think all-MiniLM-L6-v2) to vectorize table/column descriptions and perform a similarity search against the user's question. This is where Python's async/await pattern shines for concurrent I/O.

# project/context_strategy.py
import asyncio
from sqlalchemy import text as sql_text
from typing import List

class SmartContextFetcher(SchemaContextFetcher):
    """A smarter fetcher that filters tables based on the question."""
    async def get_relevant_context_async(self, user_question: str) -> str:
        question_keywords = set(user_question.lower().split())
        relevant_tables = []
        async with self.engine.begin() as conn:
            # Query information_schema for column names matching keywords
            # This is a simplistic lexical match. Production = vector search.
            query = sql_text("""
                SELECT DISTINCT table_name, column_name
                FROM information_schema.columns
                WHERE table_schema = 'public'
            """)
            result = await conn.execute(query)
            for table, column in result:
                if any(kw in column.lower() for kw in question_keywords):
                    relevant_tables.append(table)
        # Now reflect only the relevant tables
        self.metadata.clear()
        async with self.engine.begin() as conn:
            for table in set(relevant_tables):
                try:
                    await conn.run_sync(
                        lambda conn, t=table: self.metadata.reflect(
                            only=[t], bind=conn
                        )
                    )
                except Exception as e:
                    logger.warning(f"Could not reflect table {table}: {e}")
        return super().get_relevant_context(user_question)  # Now formats only relevant tables

The Safety Layer: Your Last Line of Defense

This is non-negotiable. The validation layer must parse the generated SQL and enforce policy. It must:

Block Bad Verbs: DROP, DELETE, UPDATE, INSERT, ALTER, GRANT, etc. Allow only SELECT (and maybe WITH for CTEs).
Validate Syntax: Use the database's own EXPLAIN or a SQL parser (sqlglot, sqlparse) to catch gibberish before it hits production.
Limit Scope: Enforce LIMIT clauses on all queries to prevent accidental billion-row returns. A MemoryError with large DataFrames is a gentle failure mode; a frozen production database is not.

# project/validator.py
import re
from sqlalchemy import text as sql_text
from sqlparse import parse, tokens as T
from sqlparse.sql import Statement, Token

class SQLSafetyValidator:
    """Layer 3: Validates and sanitizes generated SQL."""
    def __init__(self, allowed_keywords: set = None, max_limit: int = 1000):
        self.allowed_keywords = allowed_keywords or {'SELECT', 'WITH', 'FROM', 'WHERE', 'JOIN', 'GROUP BY', 'ORDER BY', 'LIMIT', 'AS'}
        self.max_limit = max_limit
        self._dangerous_pattern = re.compile(r'\b(DROP|DELETE|UPDATE|INSERT|ALTER|GRANT|REVOKE|TRUNCATE)\b', re.IGNORECASE)

    def validate_and_sanitize(self, raw_sql: str) -> str:
        """Returns safe SQL or raises a ValueError."""
        # 1. Block dangerous statements
        if self._dangerous_pattern.search(raw_sql):
            raise ValueError("Query contains prohibited SQL operations (DROP, DELETE, etc.).")

        # 2. Parse and check structure
        statements = parse(raw_sql)
        if not statements:
            raise ValueError("No valid SQL statement found.")
        if len(statements) > 1:
            raise ValueError("Only single statements are allowed.")

        stmt = statements[0]
        # 3. Ensure it's a SELECT-like statement. Basic check for first token.
        first_token = stmt.token_first(skip_cm=True)
        if not first_token or first_token.value.upper() not in {'SELECT', 'WITH'}:
            raise ValueError("Only SELECT and WITH (CTE) queries are permitted.")

        # 4. Inject a LIMIT if not present, respecting a potential existing one.
        sql_upper = raw_sql.upper()
        if 'LIMIT' not in sql_upper:
            sanitized = f"{raw_sql.rstrip(';')} LIMIT {self.max_limit};"
        else:
            # Simple regex to find and cap the LIMIT value. For robust parsing, use sqlglot.
            sanitized = re.sub(r'LIMIT\s+(\d+)', lambda m: f'LIMIT {min(int(m.group(1)), self.max_limit)}', raw_sql, flags=re.IGNORECASE)
        return sanitized

# Example usage and error
validator = SQLSafetyValidator(max_limit=500)
bad_sql = "SELECT * FROM users; DROP TABLE users;"
try:
    safe = validator.validate_and_sanitize(bad_sql)
except ValueError as e:
    print(f"Blocked: {e}")  # "Query contains prohibited SQL operations..."

good_sql = "SELECT customer_id, churn_date FROM customers WHERE segment = 'enterprise'"
safe_sql = validator.validate_and_sanitize(good_sql)
print(safe_sql)  # "SELECT ... WHERE segment = 'enterprise' LIMIT 500;"

Teaching the LLM JOINs and Aggregations: Prompt Patterns That Work

Smaller, cheaper models (think Llama 3.1 8B vs. 70B) struggle with complex JOIN logic. Your prompt must scaffold this reasoning. Use few-shot examples in the system prompt that mirror your schema's relationships.

Instead of just listing tables customers and orders, show an example: "Show me customers with their total order value:" paired with the correct SQL "SELECT c.id, SUM(o.amount) FROM customers c JOIN orders o ON c.id = o.customer_id GROUP BY c.id;".

Structure your final prompt like this:

System: You are a SQL expert. Use the schema below. Only use tables/columns that exist.
Schema: {context}
Examples: {few_shot_examples}
User: {question}
Assistant: (Output only the SQL query, no explanation)

This "output only SQL" instruction is crucial for clean parsing.

From Rows to Answers: The Summarization Layer

Returning a JSON blob of 50 rows to a user who asked "how many churned?" is a failure. The final layer must summarize. For aggregate questions, extract the single number from the result. For list questions, summarize trends, count, and maybe show top examples.

# project/summarizer.py
import pandas as pd
from pandas import DataFrame

class ResultSummarizer:
    """Layer 4: Turns raw query results into a natural language answer."""
    def summarize(self, question: str, result_df: DataFrame) -> str:
        # Use type hints! Adoption grew from 48% to 71% in Python projects 2022–2025 (JetBrains)
        if result_df.empty:
            return "No data found matching your criteria."
        # Heuristic detection of question type
        question_lower = question.lower()
        if any(word in question_lower for word in ['how many', 'count', 'total', 'number']):
            # Likely wants a scalar. Find the first numeric column.
            for col in result_df.columns:
                if pd.api.types.is_numeric_dtype(result_df[col]):
                    total = result_df[col].iloc[0] if len(result_df) == 1 else result_df[col].sum()
                    return f"The total is **{total:,.0f}**."
        # For list results, provide a summary
        row_count = len(result_df)
        col_sample = ", ".join(result_df.columns.tolist()[:3])
        sample = result_df.head(3).to_dict(orient='records')
        return f"Found **{row_count}** records. Columns include: {col_sample}. Sample: {sample}"

Benchmark: Does Model Size Really Matter for SQL Accuracy?

You don't need a 70B parameter model if a 7B model with good prompting is 95% as accurate for your specific schema. The academic standard is the Spider benchmark, which tests complex cross-domain JOINs. Here's what the data typically shows:

Model Size (Parameters)	Spider Benchmark Accuracy (Execution)	Approx. Tokens/sec (RTX 4090)	Use Case
7B (e.g., CodeLlama)	~65-75%	~80-100	Good for simple, single-table queries. Fast, cheap.
13B-34B (e.g., GPT-4 Mini)	~78-85%	~30-50	Handles moderate JOINs well. Sweet spot for many apps.
70B+ (e.g., GPT-4, Claude 3)	~85-92%	~5-15	Best for complex, nested queries with multiple aggregations.

The takeaway: start small. A 7B model with excellent schema context and few-shot examples will solve 80% of internal business questions. Use the 84% of Python developers who use pytest to create a test suite of 100 real user questions and benchmark your chosen model locally before committing to a costly API.

One Interface, Multiple Databases: The Adapter Pattern

Your CEO wants data from PostgreSQL, the sales team lives in Snowflake, and legacy reports are in BigQuery. Your core pipeline should be dialect-agnostic. SQLAlchemy is your friend here. Define a base DatabaseConnector class, with subclasses handling dialect-specific schema fetching (e.g., BigQuery's INFORMATION_SCHEMA has different columns).

The key is to keep the SchemaContextFetcher, Validator, and Summarizer working on the normalized output from these adapters. Your validation might need to be slightly dialect-aware (e.g., BigQuery uses # for comments, not --).

Next Steps: Production-Ready and Tested

You now have the blueprint. To move from script to system:

Containerize: Use a Dockerfile with uv as your installer—it's 10–100x faster than pip for cold installs. Pin your dependencies.
Add Observability: Log every user question, generated SQL, validation result, and query execution time. This data is gold for improving prompts.
Implement Caching: Use redis to cache the SQL generated for frequent, identical questions (e.g., "weekly active users").
Build a Test Suite: Use pytest with Hypothesis to generate random, valid questions and assert the generated SQL is safe and syntactically correct. Test edge cases: ambiguous column names, non-existent tables, malicious prompts.
Set up CI/CD: Use ruff to lint your code in 0.29s vs flake8's 16s and mypy/pyright to enforce type safety before deployment.

The goal isn't to replace your data team, but to amplify them. It's about turning "I need this data" from a two-hour Jira ticket into a 10-second conversation. Build the safety rails firmly, and you can give this powerful tool to anyone in the company without losing sleep. Now go make that RTX 4090 earn its keep.