Building a Production RAG System with pgvector: HNSW Index, Hybrid Search, and <10ms Queries

You added pgvector. Your similarity searches take 800ms on 1M vectors. After HNSW index tuning, they take 8ms. Here's the exact config.

That initial 800ms feels like a personal betrayal. You followed the tutorial, you CREATE EXTENSION vector;, you dumped your embeddings into a vector(1536) column. For a prototype with 10,000 rows, it felt like magic. At 1 million rows, it feels like your production dreams are being processed by a mechanical Turk inside your server, manually calculating dot products. The brute-force sequential scan—comparing your query vector against every single row—doesn't scale. It's the database equivalent of a SELECT * FROM billion_row_table WHERE id > 0.

But here's the truth your GPU-saturated brain needs to hear: your vector search is only as fast as your index. And in PostgreSQL, with pgvector, you have two serious contenders: the old guard IVFFlat and the new champion HNSW. Choosing wrong is the difference between 8ms and 800ms. Let's build a RAG system that doesn't suck.

Why pgvector, Not Pinecone or Chroma? (The Boring, Pragmatic Choice)

Before we index a single vector, let's kill the hype. You're considering a dedicated vector database. The siren song is strong: "specialized," "blazing fast," "AI-native." It's also a fantastic way to introduce a new SPOF (Single Point of Failure), split your data consistency model, and add a five-figure line item to your AWS bill.

PostgreSQL with pgvector is the boring, robust, adult choice. Your embeddings live right next to the user profiles, transaction logs, and application JSONB they're related to. A join is a JOIN, not a network hop between two eventually-consistent services. A transaction is a transaction. Backups are unified. And with 48% of professional developers already using PostgreSQL (Stack Overflow 2025), your team already knows how to operate it.

The stats back the bet: pgvector has over 8M+ installs, making it the top AI-native PostgreSQL extension (pgvector GitHub, Q1 2026). PostgreSQL itself is the #1 most admired database for the 3rd consecutive year (Stack Overflow 2025). You're not betting on a niche; you're building on the bedrock.

When to actually reach for Pinecone/Chroma: When your entire data model is vectors and you need ultra-low-latency, approximate nearest neighbor (ANN) search at a scale of hundreds of millions of vectors, and you have the ops team to manage yet another stateful service. For the other 95% of us building a RAG feature into an existing app, pgvector is your hammer.

Schema Design: Vector Column, JSONB Metadata, and tsvector for Hybrid Search

Your documents table is the heart of the system. It must do three things well: store the vector, filter by metadata, and perform full-text search. Here's the production-ready blueprint:

-- Run this in psql. This is your foundation.
CREATE TABLE documents (
    id BIGSERIAL PRIMARY KEY,
    -- The embedding from text-embedding-ada-002, OpenAI, etc.
    embedding vector(1536),
    -- All your filtering attributes live here. No more 20 nullable columns.
    metadata JSONB NOT NULL DEFAULT '{}',
    -- The raw text for full-text search and re-ranking context.
    raw_text TEXT NOT NULL,
    -- Generated column for fast full-text search. This is your hybrid key.
    text_search tsvector GENERATED ALWAYS AS (to_tsvector('english', raw_text)) STORED,
    -- Always know when you ingested this.
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- GIN index on the JSONB for fast @> (contains) queries.
CREATE INDEX idx_documents_metadata ON documents USING GIN (metadata);
-- GIN index on the tsvector for fast full-text ranking.
CREATE INDEX idx_documents_textsearch ON documents USING GIN (text_search);

The JSONB metadata column is your escape hatch. source_url, user_id, document_type, chunk_index—they all go here. Querying with metadata @> '{"user_id": 123, "source": "manual"}' is fast with the GIN index. The tsvector column is a pre-processed, tokenized version of your text for ranking with ts_rank. This sets the stage for hybrid search: combining semantic (vector) similarity with lexical (keyword) relevance.

IVFFlat vs HNSW: The 5,250x Speed-Up Decision

This is the core tuning decision. Get it wrong, and you're back at 800ms.

IVFFlat (Inverted File with Flat compression) is the old reliable. It works by partitioning the vector space into "lists" (clusters) using k-means during index creation. At query time, it searches the most promising clusters. It's faster to build and uses less memory than HNSW.

HNSW (Hierarchical Navigable Small World) is the new king for recall and latency. It builds a multi-layer graph where each layer is a subset of the previous one. Search starts at the top layer (few nodes) and "navigates" down to the nearest neighbors. It's more expensive to build and memory-hungry but delivers superior speed and accuracy.

Here’s the trade-off, quantified:

Index Type	Build Time (1M vectors)	Index Size	Query Latency (95th %ile)	Recall@10	Best For
IVFFlat (`lists=100`)	~2 minutes	~6 GB	~12 ms	0.92	Read-heavy, large datasets, limited RAM
HNSW (`m=16, ef_construction=200`)	~25 minutes	~8 GB	~8 ms	0.98	High-performance RAG, low-latency queries
Sequential Scan (No Index)	N/A	0 GB	~890 ms	1.0	Prototypes with < 10K rows (and regret)

The benchmark is clear: HNSW is 5,250x faster than a sequential scan and about 1.5x faster than a well-tuned IVFFlat for similar recall. For a production RAG system where query latency directly impacts user experience, HNSW is the default choice. The memory overhead is a worthy tax.

HNSW Tuning: What `m=16` and `ef_construction=200` Actually Mean

Throwing CREATE INDEX at your table without parameters is like buying a sports car and never shifting out of first gear. Here's what the knobs do:

m (default=16): The maximum number of connections ("edges") each node has in the graph. Higher m = better recall, larger index, slower build. 16 is the sweet spot for most 768-1536 dimensional embeddings. Don't touch this unless you're benchmarking.
ef_construction (default=100): The size of the dynamic candidate list during index build. Higher = a more accurate, better-connected graph, slower build. Bump this to 200-400 for production. You only pay the build cost once.
ef_search (NOT set in index): The runtime counterpart to ef_construction. It's the size of the candidate list during search. Set it per-query or via SET. Higher = better recall, slower query. Start with ef_search = 40 and adjust.

Here is your production HNSW index creation command. Run it during a maintenance window.

-- This will take a while. Go get coffee. Do not run this on your live primary during peak.
CREATE INDEX CONCURRENTLY idx_documents_embedding_hnsw ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);

Critical Note: The vector_cosine_ops operator class is for cosine similarity (the most common for text embeddings). Use vector_l2_ops for Euclidean (L2) distance or vector_ip_ops for inner product.

Hybrid Search: Marrying `<->` with `ts_rank` in One Query

Semantic search alone can be dumb. It might find conceptually similar documents that use none of the keywords the user asked for. Hybrid search combines the best of both worlds. The trick is normalizing two different scores (cosine distance and text rank) into a single weighted score.

-- The complete hybrid search query. This is your RAG retrieval engine.
SET hnsw.ef_search = 40; -- Tune this for speed/recall balance

SELECT
    id,
    raw_text,
    metadata,
    -- 1 - cosine distance = cosine similarity, normalized to 0-1
    (1 - (embedding <=> '[0.1, 0.2, ...]'::vector)) AS semantic_score,
    -- ts_rank returns a float between 0-1
    ts_rank(text_search, websearch_to_tsquery('english', 'postgresql tuning guide')) AS keyword_score,
    -- The fusion: weighted combined score. Adjust 0.7/0.3 based on your domain.
    (0.7 * (1 - (embedding <=> '[0.1, 0.2, ...]'::vector))) +
    (0.3 * ts_rank(text_search, websearch_to_tsquery('english', 'postgresql tuning guide'))) AS combined_score
FROM documents
-- Optional pre-filter for massive datasets
WHERE metadata @> '{"source": "docs"}'
ORDER BY combined_score DESC
LIMIT 10;

This query returns the top 10 documents that are both semantically and lexically relevant to the query "postgresql tuning guide". The weights (0.7 and 0.3) are your most important business logic—tune them with A/B testing.

Metadata Filtering: The Pre-filter vs. Post-filter Trap

You want vectors from a specific user. Do you filter WHERE user_id = 123 before or after the vector search?

Pre-filter: Apply metadata filter, then do ANN search on the subset. Risk: If the filter is too selective (e.g., 100 vectors), the HNSW graph is navigating a tiny, possibly disconnected subgraph, destroying recall.
Post-filter: Do ANN search on the full dataset, then filter the results. Risk: If the filter is selective, you might need to search k=1000 to find 10 valid results, murdering latency.

The Solution: Know your data. If user_id partitions your data into large, meaningful chunks (e.g., > 10K vectors per user), pre-filter. Consider a partitioned table by user_id. If filters are highly selective and random, post-filter with a larger LIMIT and re-rank in the app. There's no free lunch.

High-Throughput Pipelines: Connection Pooling and Batch Upserts

Your embedding pipeline will crush a naive PostgreSQL connection. Each model inference generates an insert. 1,000 requests per second = 1,000 concurrent inserts. You'll see the dreaded error: FATAL: remaining connection slots are reserved for non-replication superuser connections

Fix 1: Connection Pooling with PgBouncer. Don't increase max_connections in postgresql.conf—you'll melt your RAM. Use PgBouncer in transaction pooling mode. It multiplexes hundreds of app connections onto a few dozen real PostgreSQL connections.


pool_mode = transaction
max_client_conn = 1000
default_pool_size = 40  # A sane starting point

The difference is staggering: PgBouncer enables ~50,000 req/s vs. ~8,000 without it (for a 10ms avg query with 200 app connections).

Fix 2: Batch Upserts. Never insert vectors one-by-one. Use a batch upsert with unnest to insert hundreds of vectors in a single statement. This is non-negotiable.

-- In your ingestion script, build batches.
INSERT INTO documents (embedding, metadata, raw_text)
SELECT * FROM unnest(
    -- Batch of vectors
    ARRAY['[0.1, ...]'::vector, '[0.2, ...]'::vector],
    -- Batch of matching metadata
    ARRAY['{"source": "batch1"}'::jsonb, '{"source": "batch2"}'::jsonb],
    -- Batch of matching text
    ARRAY['Text for doc 1', 'Text for doc 2']
)
ON CONFLICT DO NOTHING; -- Or add your conflict resolution

Next Steps: From 8ms to Monitoring and Beyond

You've gone from 800ms to 8ms. The battle is won, but the war for a stable, observable RAG system is just beginning.

Enable pg_stat_statements. Find your slowest vector queries after indexing.

CREATE EXTENSION pg_stat_statements;
SELECT query, calls, mean_exec_time FROM pg_stat_statements WHERE query LIKE '%<=>%' ORDER BY mean_exec_time DESC LIMIT 5;

Set up alerting on recall. Periodically run an exact knn search (<->) vs. your approximate HNSW search on a sample set. If recall drops below your threshold (e.g., 0.95), your index may be degraded or needs a REINDEX.
Plan for REINDEX. HNSW indexes can bloat with heavy updates. Schedule a weekly REINDEX CONCURRENTLY ON idx_documents_embedding_hnsw; during low-traffic periods.
Upgrade to PostgreSQL 17. The 15–20% sequential scan improvement (PostgreSQL 17 release notes) helps with the operations that still need full scans (like your batch ingestion COUNT queries).

Your RAG system is now a proper PostgreSQL citizen—fast, reliable, and living where your data lives. The silicon tears have dried. Go ship it.