Problem: You Need Vector Search and Don't Know Where to Put It
You're building a RAG pipeline, a semantic search feature, or an embedding-powered recommender. Now you need somewhere to store and query vectors — and you have to decide: stick with PostgreSQL and pgvector, or spin up a dedicated vector database like Pinecone, Weaviate, or Qdrant?
You'll learn:
- What pgvector can and can't do at scale
- When a dedicated vector DB actually pays off
- How to benchmark your own workload before committing
Time: 20 min | Level: Intermediate
Why This Happens
Vector search is fundamentally different from relational queries. Finding the 10 most similar vectors to a query embedding requires computing approximate nearest neighbors (ANN) across potentially millions of high-dimensional points — not a lookup by index key.
PostgreSQL wasn't designed for this. pgvector adds the capability, but the tradeoffs matter:
Common symptoms that push teams toward dedicated DBs:
- Queries slow down past ~500k vectors even with HNSW indexes
- Memory pressure from loading vector indexes alongside relational data
- No native support for filtered ANN search that stays fast at scale
- Operational complexity when tuning
work_memand index build parameters
What pgvector Actually Gives You
pgvector (v0.7+) supports two index types: IVFFlat and HNSW. HNSW is almost always the right choice for production — better recall, faster queries at the cost of more memory and slower build times.
Step 1: Set Up pgvector with HNSW
-- Enable extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Create table with embedding column
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
content TEXT,
metadata JSONB,
embedding vector(1536) -- Dimension must match your model (1536 = OpenAI ada-002)
);
-- Build HNSW index
-- m: connections per layer (higher = better recall, more memory)
-- ef_construction: build-time search depth (higher = better recall, slower build)
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
Expected: Index builds in seconds for <100k rows. Budget ~10-15 minutes for 1M rows.
Step 2: Query with Filtering
-- Set runtime search depth (higher = better recall, slower)
SET hnsw.ef_search = 100;
-- Semantic search with metadata filter
SELECT id, content, 1 - (embedding <=> $1) AS similarity
FROM documents
WHERE metadata->>'source' = 'internal-wiki' -- Pre-filter
ORDER BY embedding <=> $1 -- Then rank by similarity
LIMIT 10;
The catch: Pre-filtering with WHERE clauses before ANN search degrades recall significantly when the filter is selective. pgvector applies the filter first, then ranks — so a filter that returns 5% of rows means your ANN search runs on 5% of the data. At small scale this is fine. At millions of rows with narrow filters, recall tanks.
Step 3: Tune for Your Workload
-- More memory for index scans (set per session or in postgresql.conf)
SET work_mem = '256MB';
-- Check index usage
EXPLAIN (ANALYZE, BUFFERS)
SELECT id FROM documents ORDER BY embedding <=> $1 LIMIT 10;
You should see: Index Scan using documents_embedding_idx — not Seq Scan. If you see a sequential scan, your index isn't being used, usually because work_mem is too low to hold it.
HNSW index scan on the left, sequential scan on the right — same query, different work_mem settings
When pgvector Is Enough
pgvector wins in these scenarios:
You already run PostgreSQL. No new infrastructure, no new operational burden, transactions work across your relational and vector data, and your existing backup/monitoring stack covers it.
Your corpus is under 1-2M vectors. HNSW recall stays above 95% and p99 latency stays under 50ms for most workloads at this scale. Past that, you need careful tuning or a different tool.
You need hybrid queries. Joining vector similarity with relational data (user permissions, date ranges, structured metadata) is natural SQL. Replicating this in a dedicated vector DB requires syncing two systems.
Budget matters. Pinecone's serverless tier is free for small workloads, but dedicated pod-based plans start at ~$70/month and climb quickly. Running pgvector on an existing RDS or Supabase instance costs nothing extra.
When You Need a Dedicated Vector DB
Pinecone
Best for teams that don't want to manage infrastructure. Serverless Pinecone handles scaling automatically and supports filtered search without recall degradation — it indexes metadata alongside vectors so filters don't reduce the ANN candidate pool.
import pinecone
pc = pinecone.Pinecone(api_key="YOUR_KEY")
index = pc.Index("my-index")
# Upsert vectors
index.upsert(vectors=[
{"id": "doc-1", "values": embedding, "metadata": {"source": "wiki", "year": 2025}}
])
# Query with filter — no recall penalty here
results = index.query(
vector=query_embedding,
top_k=10,
filter={"source": {"$eq": "wiki"}} # Applied natively during ANN search
)
Trade-off: Vendor lock-in, no joins with external relational data, egress costs at scale.
Qdrant
Open-source, self-hosted or cloud, and arguably the best performance-per-dollar for high-throughput workloads. Supports payload-based filtering during ANN search (not after), which is the key architectural difference from pgvector.
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue
client = QdrantClient("localhost", port=6333)
results = client.search(
collection_name="documents",
query_vector=query_embedding,
query_filter=Filter(
must=[FieldCondition(key="source", match=MatchValue(value="wiki"))]
),
limit=10
)
Qdrant also supports sparse vectors and sparse-dense hybrid search out of the box — useful if you want to combine BM25-style keyword relevance with semantic similarity (often outperforms either alone).
Weaviate
Strongest for teams that want a GraphQL-first API and built-in text vectorization (it can call your embedding model automatically on insert). Good for prototyping, slightly more complex operationally than Qdrant.
The Real Benchmark: Your Data
Don't trust generic benchmarks. Run this before you commit:
import time
import psycopg2
import numpy as np
conn = psycopg2.connect("postgresql://localhost/mydb")
cur = conn.cursor()
# Generate 500k random 1536-dim vectors (approximates real workload)
query_vec = np.random.rand(1536).tolist()
# Warm up
cur.execute("SELECT id FROM documents ORDER BY embedding <=> %s::vector LIMIT 10", (query_vec,))
# Measure p50 and p99 over 100 queries
latencies = []
for _ in range(100):
start = time.perf_counter()
cur.execute("SELECT id FROM documents ORDER BY embedding <=> %s::vector LIMIT 10", (query_vec,))
cur.fetchall()
latencies.append((time.perf_counter() - start) * 1000)
latencies.sort()
print(f"p50: {latencies[50]:.1f}ms")
print(f"p99: {latencies[99]:.1f}ms")
Decision thresholds (rough guidelines):
- p99 < 50ms at your expected corpus size → pgvector is fine
- p99 50-200ms → tune
hnsw.ef_search,m, andwork_memfirst - p99 > 200ms or recall < 90% → evaluate dedicated options
Verification
# Check recall against brute-force exact search
# If HNSW recall < 0.90, increase m and ef_construction
psql -c "
WITH exact AS (
SELECT id FROM documents ORDER BY embedding <=> '[...]'::vector LIMIT 10
),
approx AS (
SELECT id FROM documents ORDER BY embedding <=> '[...]'::vector LIMIT 10
)
SELECT COUNT(*) * 1.0 / 10 AS recall
FROM exact WHERE id IN (SELECT id FROM approx);
"
You should see: recall >= 0.90 for production use. Below that, bump m from 16 to 32.
What You Learned
- pgvector with HNSW is production-ready for corpora under ~2M vectors with loose filters
- Pre-filtering in pgvector degrades recall — dedicated DBs handle this natively
- Qdrant is the strongest self-hosted alternative; Pinecone for managed simplicity
- Benchmark with your actual data and filter selectivity before making the call
Limitation: These numbers shift with embedding dimension, hardware, and concurrent query load. A 768-dim model (e.g., nomic-embed-text) puts less pressure on pgvector than 3072-dim (OpenAI text-embedding-3-large).
When NOT to use a dedicated vector DB: If you have fewer than 500k vectors and your filters match >20% of the corpus, pgvector will almost certainly be fast enough — and one less system to operate is worth a lot.
Tested on pgvector 0.7.4, PostgreSQL 16.2, Qdrant 1.9, Pinecone serverless (us-east-1), Ubuntu 24.04 / 32GB RAM