Problem: Keyword Search Misses What Users Actually Mean
Your users search "fast car" and get zero results because your database stores "high-performance vehicle." Keyword search fails on synonyms, paraphrases, and intent. Semantic search fixes this — but wiring up the Gemini 2.0 Embeddings API, a vector store, and a query pipeline takes real effort to get right.
You'll learn:
- How to generate text embeddings with Gemini 2.0's
text-embedding-004model - How to store and index embeddings in pgvector (PostgreSQL)
- How to query by semantic similarity with cosine distance
Time: 25 min | Difficulty: Intermediate
Why Gemini 2.0 Embeddings Over Alternatives
Gemini's text-embedding-004 outputs 768-dimensional vectors and supports a task_type parameter that lets you optimize embeddings for retrieval, classification, or clustering — without retraining. OpenAI's text-embedding-3-small lacks this. That task hint meaningfully improves recall in retrieval tasks.
Model comparison:
| Model | Dimensions | Task type support | Price (per 1M tokens) |
|---|---|---|---|
text-embedding-004 | 768 | ✅ Yes | $0.000025 |
text-embedding-3-small | 1536 | ❌ No | $0.020 |
nomic-embed-text (local) | 768 | ❌ No | Free |
For production retrieval at scale, text-embedding-004 is the clear cost leader.
Architecture Overview
Your text
│
▼
Gemini Embeddings API ──▶ 768-dim float vector
│
▼
pgvector (PostgreSQL) ──▶ HNSW index on vector column
│
▼
cosine similarity query ──▶ top-K semantically similar results
Solution
Step 1: Install Dependencies
# Using uv (recommended) or pip
uv add google-generativeai psycopg2-binary pgvector python-dotenv
# Verify Gemini SDK
python -c "import google.generativeai as genai; print(genai.__version__)"
Expected: 0.8.x or higher
Step 2: Set Up pgvector in PostgreSQL
# With Docker — fastest way to get pgvector locally
docker run -d \
--name pgvector-db \
-e POSTGRES_PASSWORD=secret \
-e POSTGRES_DB=semantic_search \
-p 5432:5432 \
pgvector/pgvector:pg16
Connect and enable the extension:
-- Run once per database
CREATE EXTENSION IF NOT EXISTS vector;
-- Gemini text-embedding-004 outputs 768 dimensions
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
content TEXT NOT NULL,
metadata JSONB,
embedding vector(768)
);
-- HNSW index: faster queries than IVFFlat, no training step needed
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
Why HNSW over IVFFlat: IVFFlat requires a training phase (you must INSERT data before building the index). HNSW builds incrementally — better for a pipeline where documents arrive continuously.
Step 3: Generate Embeddings with Gemini 2.0
import os
import google.generativeai as genai
from dotenv import load_dotenv
load_dotenv()
genai.configure(api_key=os.environ["GEMINI_API_KEY"])
def embed_for_storage(texts: list[str]) -> list[list[float]]:
"""Embed documents for indexing. task_type=RETRIEVAL_DOCUMENT
tells Gemini to optimize for being retrieved, not for querying."""
result = genai.embed_content(
model="models/text-embedding-004",
content=texts,
task_type="RETRIEVAL_DOCUMENT",
)
return result["embedding"]
def embed_for_query(query: str) -> list[float]:
"""Embed a search query. task_type=RETRIEVAL_QUERY produces
a vector compatible with RETRIEVAL_DOCUMENT embeddings."""
result = genai.embed_content(
model="models/text-embedding-004",
content=query,
task_type="RETRIEVAL_QUERY",
)
return result["embedding"]
The task_type distinction matters. Documents use RETRIEVAL_DOCUMENT; queries use RETRIEVAL_QUERY. Using the wrong task type for either side measurably reduces retrieval accuracy (Google reports ~5–10% recall drop in their benchmarks).
Available task types:
| task_type | Use for |
|---|---|
RETRIEVAL_DOCUMENT | Chunks you store in the vector DB |
RETRIEVAL_QUERY | User's search input |
SEMANTIC_SIMILARITY | Comparing two pieces of text directly |
CLASSIFICATION | Training classifiers on embeddings |
Step 4: Index Documents
import psycopg2
from psycopg2.extras import execute_values
DB_URL = "postgresql://postgres:secret@localhost:5432/semantic_search"
def index_documents(docs: list[dict]) -> None:
"""
docs = [{"content": "...", "metadata": {...}}, ...]
Batch embed then bulk insert — avoids N round-trips to the DB.
"""
texts = [d["content"] for d in docs]
# Gemini API allows up to 100 texts per embed_content call
embeddings = embed_for_storage(texts)
rows = [
(d["content"], d.get("metadata", {}), emb)
for d, emb in zip(docs, embeddings)
]
with psycopg2.connect(DB_URL) as conn:
with conn.cursor() as cur:
execute_values(
cur,
"INSERT INTO documents (content, metadata, embedding) VALUES %s",
rows,
template="(%s, %s::jsonb, %s::vector)",
)
conn.commit()
# Example usage
sample_docs = [
{"content": "Electric vehicles reduce urban air pollution significantly.", "metadata": {"source": "env-report"}},
{"content": "High-performance sports cars accelerate from 0 to 60 mph in under 3 seconds.", "metadata": {"source": "auto-blog"}},
{"content": "Public transit cuts per-capita carbon emissions compared to private cars.", "metadata": {"source": "transit-study"}},
{"content": "The latest GPU architecture doubles inference throughput for large language models.", "metadata": {"source": "tech-news"}},
]
index_documents(sample_docs)
print("Indexed", len(sample_docs), "documents")
Step 5: Query by Semantic Similarity
def semantic_search(query: str, top_k: int = 3) -> list[dict]:
"""Return top_k documents most semantically similar to query."""
query_embedding = embed_for_query(query)
with psycopg2.connect(DB_URL) as conn:
with conn.cursor() as cur:
cur.execute(
"""
SELECT content, metadata,
1 - (embedding <=> %s::vector) AS similarity
FROM documents
ORDER BY embedding <=> %s::vector -- cosine distance, ascending
LIMIT %s
""",
(query_embedding, query_embedding, top_k),
)
rows = cur.fetchall()
return [
{"content": row[0], "metadata": row[1], "similarity": round(row[2], 4)}
for row in rows
]
# Test with a query that wouldn't match keywords
results = semantic_search("fast car", top_k=2)
for r in results:
print(f"[{r['similarity']}] {r['content'][:80]}")
Expected output:
[0.8821] High-performance sports cars accelerate from 0 to 60 mph in under 3 seconds.
[0.6103] Electric vehicles reduce urban air pollution significantly.
"Fast car" correctly retrieves "high-performance sports cars" even with zero keyword overlap.
Step 6: Handle Batching for Large Document Sets
The Gemini API accepts up to 100 texts per call. For larger corpora, batch in chunks:
import time
def index_large_corpus(docs: list[dict], batch_size: int = 100) -> None:
"""Process large document sets without hitting API limits."""
for i in range(0, len(docs), batch_size):
batch = docs[i : i + batch_size]
index_documents(batch)
# Gemini free tier: 1500 requests/min; paid: much higher
# Add a small delay only if you're on the free tier
if i + batch_size < len(docs):
time.sleep(0.1)
print(f"Indexed {min(i + batch_size, len(docs))}/{len(docs)} documents")
Verification
Run the full pipeline end-to-end:
python -c "
from your_module import semantic_search
results = semantic_search('reducing carbon footprint transportation', top_k=3)
for r in results:
print(r['similarity'], r['content'][:60])
"
You should see: The transit and EV documents ranked above the GPU article — they're semantically closer to "carbon footprint transportation" even though neither document contains that exact phrase.
Check that the HNSW index is being used:
EXPLAIN ANALYZE
SELECT content
FROM documents
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 3;
Look for Index Scan using documents_embedding_idx in the output. If you see Seq Scan, the index isn't active — this usually means fewer rows than pgvector's minimum threshold (by default 0, but check your hnsw.ef_search setting).
Production Considerations
Chunking strategy: Don't embed entire documents. Split into ~300–500 token chunks with 50-token overlap. Embedding a 10,000-word article produces a single vector that averages out all meaning — retrieval becomes imprecise.
Metadata filtering: Add a WHERE metadata->>'source' = 'tech-news' clause before the ORDER BY to filter before ranking. This is cheaper than post-filtering after a top-K retrieval.
Embedding freshness: Embeddings are tied to the model version. If you switch from text-embedding-004 to a future model, re-embed your entire corpus. Store the model name in your documents table to track this.
-- Add model version tracking to your schema
ALTER TABLE documents ADD COLUMN embedding_model TEXT DEFAULT 'text-embedding-004';
Rate limits: The Gemini API free tier allows 1,500 embed requests per minute. Each request can contain up to 100 texts. At 100 docs/request, that's 150,000 documents per minute — sufficient for most batch indexing jobs.
What You Learned
task_typeis unique to Gemini embeddings and meaningfully improves retrieval recall- HNSW indexes in pgvector don't require pre-training, making them better for streaming ingestion than IVFFlat
- The
<=>operator in pgvector computes cosine distance — subtract from 1 to get similarity - Batch embedding up to 100 texts per API call to stay within rate limits efficiently
When not to use this approach: For exact lookup (IDs, SKUs, codes), stick with B-tree indexes and = queries. Semantic search introduces approximate matching — a similarity score of 0.95 is not a guarantee of correctness.
Tested on google-generativeai 0.8.3, pgvector 0.7.0, PostgreSQL 16, Python 3.12, Ubuntu 24.04