pgvector Semantic Search: Turn PostgreSQL into a Vector Database

Problem: Your PostgreSQL Data Can't Answer "Find Something Similar"

LIKE queries match exact strings. Full-text search matches keywords. Neither understands meaning.

When a user searches "affordable sedans under 30k" in your car listings database, they won't find rows containing "budget compact vehicles" — even though it's the same intent.

pgvector fixes this. It adds a vector column type to PostgreSQL, lets you store text embeddings alongside your existing data, and answers similarity questions with a single SQL query.

You'll learn:

How to install pgvector and create your first vector column
How to generate and store OpenAI embeddings in PostgreSQL
How to run fast ANN (approximate nearest neighbor) queries with HNSW indexes
How to combine semantic search with traditional SQL filters (hybrid search)

Time: 25 min | Difficulty: Intermediate

Why pgvector Instead of a Dedicated Vector DB

Pinecone, Qdrant, and Weaviate are purpose-built for vectors. They're excellent — but they're extra infrastructure, extra cost, and extra sync complexity.

If you're already on PostgreSQL, pgvector gives you 80–90% of the capability with zero new services to run. Your embeddings sit in the same database as your users, products, and orders. You get joins, transactions, backups, and row-level security for free.

When to use pgvector:

Existing PostgreSQL stack
< 5M vectors (scales further with partitioning)
You want SQL-native hybrid search (vector + filters in one query)

When to use a dedicated vector DB:

10M vectors with strict sub-10ms latency SLAs
Multi-tenancy with per-tenant vector spaces at scale
Teams without PostgreSQL expertise

How pgvector Works

pgvector stores vectors as a new column type (vector(dimensions)). Each row gets a floating-point array — the embedding of your text, image, or structured data.

At query time, pgvector computes the distance between your query vector and every stored vector. With an index, it approximates this with ANN search instead of exhaustive scan.

Text/Content ──▶ Embedding Model ──▶ [0.12, -0.45, 0.88, …] ──▶ vector column
                                                                       │
Query Text ────▶ Embedding Model ──▶ query vector ──▶ <=> operator ──▶ Top-K results

Three distance operators:

<=> — cosine distance (most common for text)
<-> — L2 / Euclidean distance
<#> — negative inner product (for dot-product similarity)

Solution

Step 1: Install pgvector

Docker (fastest for dev):

# Postgres 16 with pgvector pre-installed
docker run -d \
  --name pgvector-dev \
  -e POSTGRES_PASSWORD=secret \
  -p 5432:5432 \
  pgvector/pgvector:pg16

Existing PostgreSQL on Ubuntu:

# Install build deps
sudo apt install postgresql-server-dev-16

# Clone and build pgvector
git clone https://github.com/pgvector/pgvector.git
cd pgvector
make
sudo make install

Enable the extension in your database:

CREATE EXTENSION IF NOT EXISTS vector;

-- Verify
SELECT * FROM pg_extension WHERE extname = 'vector';

Expected output:

 extname | extversion
---------+-----------
 vector  | 0.8.0

If it fails:

ERROR: could not open extension control file → pgvector isn't installed at the OS level; repeat the build step above
ERROR: extension "vector" already exists → safe to ignore, or use IF NOT EXISTS

Step 2: Create a Table with a Vector Column

-- Example: product catalog with semantic search
CREATE TABLE products (
  id          SERIAL PRIMARY KEY,
  name        TEXT NOT NULL,
  description TEXT NOT NULL,
  price       NUMERIC(10, 2),
  category    TEXT,
  -- 1536 dimensions = OpenAI text-embedding-3-small output size
  embedding   vector(1536)
);

The dimension count must match your embedding model exactly:

Model	Dimensions
`text-embedding-3-small`	1536
`text-embedding-3-large`	3072
`text-embedding-ada-002`	1536
`nomic-embed-text` (local)	768
`mxbai-embed-large` (local)	1024

Step 3: Generate and Store Embeddings

Install dependencies:

pip install openai psycopg2-binary python-dotenv

# embed_products.py
import os
import psycopg2
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def get_embedding(text: str) -> list[float]:
    # Embed the description — what users search for
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def store_product(conn, name: str, description: str, price: float, category: str):
    embedding = get_embedding(description)

    with conn.cursor() as cur:
        cur.execute(
            """
            INSERT INTO products (name, description, price, category, embedding)
            VALUES (%s, %s, %s, %s, %s)
            """,
            (name, description, price, category, embedding)
        )
    conn.commit()

# Sample data
products = [
    ("Trail Runner X9", "Lightweight hiking boot for rocky mountain terrain, waterproof", 189.99, "footwear"),
    ("Budget Road Bike", "Affordable entry-level road bicycle for commuting and fitness", 549.00, "cycling"),
    ("Noise-Cancel Pro Headphones", "Over-ear headphones with active noise cancellation for travel", 299.00, "audio"),
    ("Ultralight Tent", "Two-person backpacking tent weighing under 1kg, easy setup", 399.00, "camping"),
    ("Commuter E-Bike", "Electric bicycle for city commuting, 60km range per charge", 1299.00, "cycling"),
]

conn = psycopg2.connect("postgresql://postgres:secret@localhost:5432/postgres")

for name, desc, price, category in products:
    store_product(conn, name, desc, price, category)
    print(f"Stored: {name}")

conn.close()

Run it:

OPENAI_API_KEY=sk-... python embed_products.py

Expected output:

Stored: Trail Runner X9
Stored: Budget Road Bike
Stored: Noise-Cancel Pro Headphones
Stored: Ultralight Tent
Stored: Commuter E-Bike

Step 4: Run a Semantic Search Query

# search.py
import os
import psycopg2
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def semantic_search(query: str, top_k: int = 3) -> list[dict]:
    # Embed the user's query with the same model used at insert time
    query_embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    ).data[0].embedding

    conn = psycopg2.connect("postgresql://postgres:secret@localhost:5432/postgres")
    with conn.cursor() as cur:
        cur.execute(
            """
            SELECT
                name,
                description,
                price,
                category,
                -- cosine distance: 0 = identical, 2 = opposite
                1 - (embedding <=> %s::vector) AS similarity
            FROM products
            ORDER BY embedding <=> %s::vector
            LIMIT %s
            """,
            (query_embedding, query_embedding, top_k)
        )
        rows = cur.fetchall()

    conn.close()
    return [
        {"name": r[0], "description": r[1], "price": r[2], "category": r[3], "similarity": round(r[4], 4)}
        for r in rows
    ]

results = semantic_search("bike for getting around the city")
for r in results:
    print(f"{r['similarity']:.4f}  {r['name']}  (${r['price']})")

Expected output:

0.8821  Commuter E-Bike  ($1299.00)
0.7634  Budget Road Bike  ($549.00)
0.4112  Trail Runner X9  ($189.99)

The E-Bike wins even though the query never says "electric" — the model understands "getting around the city" maps to commuter use.

Step 5: Add an HNSW Index for Production Speed

Without an index, pgvector does an exact sequential scan — fine for < 50k rows, too slow beyond that.

HNSW (Hierarchical Navigable Small World) is the recommended index type. It's faster at query time than IVFFlat and doesn't require a training step.

-- Create HNSW index on cosine distance
-- m: connections per node (16 is default; higher = better recall, more memory)
-- ef_construction: build-time search width (64 default; higher = better index quality)
CREATE INDEX ON products USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

Tune query-time recall with ef_search (per session or globally):

-- Higher ef_search = better recall, slightly slower query
SET hnsw.ef_search = 100;

-- Now run your similarity query — same syntax, index is used automatically
SELECT name, 1 - (embedding <=> '[...]'::vector) AS similarity
FROM products
ORDER BY embedding <=> '[...]'::vector
LIMIT 5;

IVFFlat vs HNSW:

	IVFFlat	HNSW
Index build speed	Fast	Slower
Query speed	Fast	Faster
Recall accuracy	Good	Better
Memory usage	Lower	Higher
Requires training data	Yes (need rows first)	No

Rule of thumb: Use HNSW for new projects. IVFFlat only if memory is tightly constrained.

Step 6: Hybrid Search — Semantic + SQL Filters

The real power of pgvector over standalone vector databases: combine vector similarity with any SQL predicate in a single query.

def hybrid_search(query: str, category: str = None, max_price: float = None, top_k: int = 5) -> list[dict]:
    query_embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    ).data[0].embedding

    # Build filter conditions dynamically
    conditions = []
    params = [query_embedding, query_embedding]

    if category:
        conditions.append("category = %s")
        params.append(category)

    if max_price:
        conditions.append("price <= %s")
        params.append(max_price)

    where_clause = ("WHERE " + " AND ".join(conditions)) if conditions else ""
    params.append(top_k)

    sql = f"""
        SELECT name, description, price, category,
               1 - (embedding <=> %s::vector) AS similarity
        FROM products
        {where_clause}
        ORDER BY embedding <=> %s::vector
        LIMIT %s
    """

    conn = psycopg2.connect("postgresql://postgres:secret@localhost:5432/postgres")
    with conn.cursor() as cur:
        cur.execute(sql, params)
        rows = cur.fetchall()
    conn.close()

    return [
        {"name": r[0], "description": r[1], "price": r[2], "category": r[3], "similarity": round(r[4], 4)}
        for r in rows
    ]

# "Show me cycling gear under $600"
results = hybrid_search("something to ride around town", category="cycling", max_price=600)
for r in results:
    print(f"{r['similarity']:.4f}  {r['name']}  (${r['price']})")

Expected output:

0.7634  Budget Road Bike  ($549.00)

The E-Bike is excluded by the price <= 600 filter — vector similarity still ranks within the filtered set.

Verification

Check your index is actually being used:

EXPLAIN (ANALYZE, BUFFERS)
SELECT name, embedding <=> '[0.1, 0.2, ...]'::vector AS distance
FROM products
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 5;

Look for Index Scan using products_embedding_idx in the output. If you see Seq Scan, the index isn't being hit — usually because the table is too small (PostgreSQL skips indexes on tiny tables) or ef_search is set too high.

Check vector column sizes and row counts:

SELECT
    COUNT(*) AS total_rows,
    COUNT(embedding) AS rows_with_embeddings,
    COUNT(*) - COUNT(embedding) AS missing_embeddings
FROM products;

You should see: missing_embeddings = 0 before going to production.

Production Considerations

Batch your embedding calls. OpenAI's API accepts up to 2048 inputs per request. Embedding one row at a time is slow and expensive.

# ✅ Batch embed — 100x faster than one-by-one
texts = [row["description"] for row in rows]
response = client.embeddings.create(model="text-embedding-3-small", input=texts)
embeddings = [item.embedding for item in response.data]

Use pgvector with connection pooling. PgBouncer works fine; just make sure hnsw.ef_search is set at the query level, not session level, if you're using transaction-mode pooling.

Keep embedding model consistent. If you switch from text-embedding-3-small to text-embedding-3-large, re-embed everything. Mixing dimensions or models in the same column produces garbage similarity scores.

Index maintenance. HNSW indexes update automatically on INSERT/UPDATE. For bulk loads (100k+ rows), insert first, then create the index — it builds much faster on existing data than incrementally.

Monitor index bloat for high-write tables:

SELECT pg_size_pretty(pg_relation_size('products_embedding_idx')) AS index_size;

What You Learned

pgvector adds a native vector type to PostgreSQL — no new database required
The <=> operator computes cosine distance; ORDER BY embedding <=> query_vec LIMIT k is your core query pattern
HNSW indexes make similarity queries fast at scale; create them after bulk inserts
Hybrid search (vector + SQL filters) is pgvector's biggest advantage over standalone vector databases — it's a single query, not two round trips

Limitation: pgvector's ANN search trades a small amount of recall accuracy for speed. For use cases requiring 100% recall (legal, medical), run exact search (SET enable_indexscan = off) on a partitioned subset rather than the full table.

Tested on pgvector 0.8.0, PostgreSQL 16.3, Python 3.12, openai SDK 1.30, Ubuntu 24.04 and macOS 14