Problem: grep Can't Find Code by Meaning
You remember writing a function that "validates email formats" but grep fails because you named it check_email_syntax(). Traditional search needs exact keywords while you think in concepts.
You'll learn:
- Set up ChromaDB or LanceDB for semantic code search
- Index your codebase with local embeddings (no API costs)
- Query by meaning: "email validation" finds
check_email_syntax()
Time: 20 min | Level: Intermediate
Why This Matters
grep searches for text patterns. Vector databases search by semantic meaning using embeddings - mathematical representations of code that capture intent, not just words.
Real scenarios:
- Finding similar code across microservices
- Discovering duplicate logic with different names
- Onboarding: "show me all authentication code"
- Refactoring: "find error handling patterns"
Local means: No API calls, works offline, free inference, your code stays private.
ChromaDB vs LanceDB: Quick Decision
Use ChromaDB if:
- You want the easiest setup (pure Python)
- Your codebase is under 100k files
- You need quick prototyping
Use LanceDB if:
- You need production-scale performance
- You're indexing 100k+ code snippets
- You want Apache Arrow integration
Both run locally, both are free. This guide covers both.
Solution
Step 1: Install Dependencies
# ChromaDB setup (easier)
pip install chromadb sentence-transformers --break-system-packages
# OR LanceDB setup (faster)
pip install lancedb sentence-transformers tantivy --break-system-packages
Expected: Both install in ~2 minutes. sentence-transformers downloads a 400MB model on first run.
If it fails:
- M1/M2 Mac errors: Install Rust first:
brew install rust - Linux missing gcc:
sudo apt install build-essential
Step 2: Create the Indexer
For ChromaDB:
# index_codebase.py
import chromadb
from sentence_transformers import SentenceTransformer
import os
from pathlib import Path
# Local model - no API calls
model = SentenceTransformer('all-MiniLM-L6-v2') # 80MB, fast inference
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
name="codebase",
metadata={"hnsw:space": "cosine"} # Cosine similarity for code
)
def index_python_files(root_dir):
"""Walk directory and index all .py files"""
for filepath in Path(root_dir).rglob("*.py"):
if "venv" in str(filepath) or "__pycache__" in str(filepath):
continue # Skip virtual envs
with open(filepath, 'r', encoding='utf-8') as f:
code = f.read()
# Split into functions (naive split by 'def ')
functions = [chunk for chunk in code.split('\ndef ') if chunk.strip()]
for i, func in enumerate(functions):
# Generate embedding
embedding = model.encode(func).tolist()
collection.add(
ids=[f"{filepath}_{i}"],
embeddings=[embedding],
documents=[func],
metadatas=[{"file": str(filepath), "index": i}]
)
print(f"Indexed {collection.count()} code snippets")
# Index your codebase
index_python_files("./src")
Why this works: all-MiniLM-L6-v2 generates 384-dim vectors that capture semantic meaning. ChromaDB stores them with HNSW index for fast similarity search.
For LanceDB:
# index_codebase_lance.py
import lancedb
from sentence_transformers import SentenceTransformer
from pathlib import Path
model = SentenceTransformer('all-MiniLM-L6-v2')
# LanceDB uses Apache Arrow tables
db = lancedb.connect("./lance_db")
def index_python_files(root_dir):
data = []
for filepath in Path(root_dir).rglob("*.py"):
if "venv" in str(filepath) or "__pycache__" in str(filepath):
continue
with open(filepath, 'r', encoding='utf-8') as f:
code = f.read()
functions = [chunk for chunk in code.split('\ndef ') if chunk.strip()]
for i, func in enumerate(functions):
data.append({
"id": f"{filepath}_{i}",
"code": func[:2000], # Truncate long functions
"file": str(filepath),
"vector": model.encode(func).tolist()
})
# Create table (auto-indexes vectors)
table = db.create_table("codebase", data=data, mode="overwrite")
print(f"Indexed {len(data)} code snippets")
index_python_files("./src")
Why Lance is faster: Uses columnar storage (Parquet) and native vector indices. Better for large codebases.
Step 3: Search Your Code
ChromaDB search:
# search_code.py
import chromadb
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_collection("codebase")
def search(query, top_k=5):
"""Find code snippets by semantic meaning"""
query_embedding = model.encode(query).tolist()
results = collection.query(
query_embeddings=[query_embedding],
n_results=top_k,
include=["documents", "metadatas", "distances"]
)
for i, (doc, meta, dist) in enumerate(zip(
results['documents'][0],
results['metadatas'][0],
results['distances'][0]
)):
print(f"\n{i+1}. {meta['file']} (similarity: {1-dist:.2f})")
print(doc[:200] + "...") # Show first 200 chars
# Try it
search("email validation logic")
search("database connection pooling")
search("error handling for HTTP requests")
Expected output:
1. src/utils/validators.py (similarity: 0.84)
def check_email_syntax(email_str):
"""Validates email format using regex...
2. src/auth/email.py (similarity: 0.78)
def verify_email_format(address):
pattern = r'^[\w\.-]+@[\w\.-]+...
LanceDB search:
# search_code_lance.py
import lancedb
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
db = lancedb.connect("./lance_db")
table = db.open_table("codebase")
def search(query, top_k=5):
query_vector = model.encode(query).tolist()
# LanceDB returns pandas DataFrame
results = table.search(query_vector).limit(top_k).to_pandas()
for _, row in results.iterrows():
print(f"\n{row['file']} (distance: {row['_distance']:.3f})")
print(row['code'][:200] + "...")
search("email validation logic")
Step 4: Advanced Filtering
Filter by file path or metadata:
# ChromaDB with filters
results = collection.query(
query_embeddings=[query_embedding],
n_results=10,
where={"file": {"$contains": "auth"}} # Only search auth folder
)
# LanceDB with SQL-like filters
results = table.search(query_vector) \
.where("file LIKE '%auth%'") \
.limit(5) \
.to_pandas()
Why this matters: Narrow search to specific modules when you know the context.
Step 5: Update Index Efficiently
# Update only changed files
import hashlib
import json
def hash_file(filepath):
with open(filepath, 'rb') as f:
return hashlib.sha256(f.read()).hexdigest()
# Store hashes
hashes = json.load(open("file_hashes.json", "r")) if os.path.exists("file_hashes.json") else {}
for filepath in Path("./src").rglob("*.py"):
current_hash = hash_file(filepath)
if hashes.get(str(filepath)) != current_hash:
# Re-index only this file
reindex_file(filepath)
hashes[str(filepath)] = current_hash
json.dump(hashes, open("file_hashes.json", "w"))
Why incremental updates: Avoids re-embedding your entire codebase on every change.
Verification
Test semantic search:
python search_code.py
You should see: Functions ranked by semantic similarity, not keyword matching. "Email validation" finds check_email_syntax() even though they share no words.
Performance check:
- ChromaDB: ~10ms query latency for 10k snippets
- LanceDB: ~3ms query latency for 100k snippets
What You Learned
- Vector DBs enable semantic search (meaning, not keywords)
sentence-transformersruns locally, no API costs- ChromaDB = easy setup, LanceDB = production scale
- Incremental indexing keeps it fast as codebase grows
Limitations:
- Embeddings miss exact identifier searches (still use grep for
def myFunction) - Model context:
all-MiniLM-L6-v2handles ~512 tokens, split long functions - Initial indexing takes time (100k files ~30 minutes)
When NOT to use this:
- Small projects (<50 files) - grep is faster
- Searching for specific variable names - use language servers
- Real-time search during typing - this is batch indexing
Production Tips
1. Hybrid search (combine vector + keyword):
# ChromaDB: combine semantic + text search
results = collection.query(
query_embeddings=[query_embedding],
where_document={"$contains": "validate"} # Must contain keyword
)
2. Better embeddings for code:
# Use code-specific model (350MB, slower but more accurate)
model = SentenceTransformer('microsoft/codebert-base')
3. RAM optimization:
# ChromaDB: limit memory usage
client = chromadb.PersistentClient(
path="./chroma_db",
settings=Settings(
anonymized_telemetry=False,
allow_reset=True
)
)
4. Monitor index size:
# ChromaDB stores in SQLite + DuckDB
du -sh ./chroma_db # Expect ~1MB per 1000 snippets
# LanceDB stores in Parquet
du -sh ./lance_db # More efficient, ~0.5MB per 1000 snippets
Quick Reference
Installation:
# ChromaDB (easiest)
pip install chromadb sentence-transformers
# LanceDB (fastest)
pip install lancedb sentence-transformers
Index code:
# ChromaDB: 3 lines
client = chromadb.PersistentClient(path="./db")
collection = client.create_collection("code")
collection.add(ids=[...], embeddings=[...], documents=[...])
# LanceDB: 2 lines
db = lancedb.connect("./db")
table = db.create_table("code", data=[{...}])
Search:
# ChromaDB
results = collection.query(query_embeddings=[...], n_results=5)
# LanceDB
results = table.search(vector).limit(5).to_pandas()
Tested with ChromaDB 0.5.7, LanceDB 0.14.0, sentence-transformers 3.3.1, Python 3.11+, macOS & Ubuntu