Problem: grep Can't Find Code by Meaning

You remember writing a function that "validates email formats" but grep fails because you named it check_email_syntax(). Traditional search needs exact keywords while you think in concepts.

You'll learn:

Set up ChromaDB or LanceDB for semantic code search
Index your codebase with local embeddings (no API costs)
Query by meaning: "email validation" finds check_email_syntax()

Time: 20 min | Level: Intermediate

Why This Matters

grep searches for text patterns. Vector databases search by semantic meaning using embeddings - mathematical representations of code that capture intent, not just words.

Real scenarios:

Finding similar code across microservices
Discovering duplicate logic with different names
Onboarding: "show me all authentication code"
Refactoring: "find error handling patterns"

Local means: No API calls, works offline, free inference, your code stays private.

ChromaDB vs LanceDB: Quick Decision

Use ChromaDB if:

You want the easiest setup (pure Python)
Your codebase is under 100k files
You need quick prototyping

Use LanceDB if:

You need production-scale performance
You're indexing 100k+ code snippets
You want Apache Arrow integration

Both run locally, both are free. This guide covers both.

Solution

Step 1: Install Dependencies

# ChromaDB setup (easier)
pip install chromadb sentence-transformers --break-system-packages

# OR LanceDB setup (faster)
pip install lancedb sentence-transformers tantivy --break-system-packages

Expected: Both install in ~2 minutes. sentence-transformers downloads a 400MB model on first run.

If it fails:

M1/M2 Mac errors: Install Rust first: brew install rust
Linux missing gcc: sudo apt install build-essential

Step 2: Create the Indexer

For ChromaDB:

# index_codebase.py
import chromadb
from sentence_transformers import SentenceTransformer
import os
from pathlib import Path

# Local model - no API calls
model = SentenceTransformer('all-MiniLM-L6-v2')  # 80MB, fast inference

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
    name="codebase",
    metadata={"hnsw:space": "cosine"}  # Cosine similarity for code
)

def index_python_files(root_dir):
    """Walk directory and index all .py files"""
    for filepath in Path(root_dir).rglob("*.py"):
        if "venv" in str(filepath) or "__pycache__" in str(filepath):
            continue  # Skip virtual envs
            
        with open(filepath, 'r', encoding='utf-8') as f:
            code = f.read()
            
        # Split into functions (naive split by 'def ')
        functions = [chunk for chunk in code.split('\ndef ') if chunk.strip()]
        
        for i, func in enumerate(functions):
            # Generate embedding
            embedding = model.encode(func).tolist()
            
            collection.add(
                ids=[f"{filepath}_{i}"],
                embeddings=[embedding],
                documents=[func],
                metadatas=[{"file": str(filepath), "index": i}]
            )
    
    print(f"Indexed {collection.count()} code snippets")

# Index your codebase
index_python_files("./src")

Why this works: all-MiniLM-L6-v2 generates 384-dim vectors that capture semantic meaning. ChromaDB stores them with HNSW index for fast similarity search.

For LanceDB:

# index_codebase_lance.py
import lancedb
from sentence_transformers import SentenceTransformer
from pathlib import Path

model = SentenceTransformer('all-MiniLM-L6-v2')

# LanceDB uses Apache Arrow tables
db = lancedb.connect("./lance_db")

def index_python_files(root_dir):
    data = []
    
    for filepath in Path(root_dir).rglob("*.py"):
        if "venv" in str(filepath) or "__pycache__" in str(filepath):
            continue
            
        with open(filepath, 'r', encoding='utf-8') as f:
            code = f.read()
            
        functions = [chunk for chunk in code.split('\ndef ') if chunk.strip()]
        
        for i, func in enumerate(functions):
            data.append({
                "id": f"{filepath}_{i}",
                "code": func[:2000],  # Truncate long functions
                "file": str(filepath),
                "vector": model.encode(func).tolist()
            })
    
    # Create table (auto-indexes vectors)
    table = db.create_table("codebase", data=data, mode="overwrite")
    print(f"Indexed {len(data)} code snippets")

index_python_files("./src")

Why Lance is faster: Uses columnar storage (Parquet) and native vector indices. Better for large codebases.

Step 3: Search Your Code

ChromaDB search:

# search_code.py
import chromadb
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_collection("codebase")

def search(query, top_k=5):
    """Find code snippets by semantic meaning"""
    query_embedding = model.encode(query).tolist()
    
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        include=["documents", "metadatas", "distances"]
    )
    
    for i, (doc, meta, dist) in enumerate(zip(
        results['documents'][0],
        results['metadatas'][0],
        results['distances'][0]
    )):
        print(f"\n{i+1}. {meta['file']} (similarity: {1-dist:.2f})")
        print(doc[:200] + "...")  # Show first 200 chars

# Try it
search("email validation logic")
search("database connection pooling")
search("error handling for HTTP requests")

Expected output:

1. src/utils/validators.py (similarity: 0.84)
def check_email_syntax(email_str):
    """Validates email format using regex...

2. src/auth/email.py (similarity: 0.78)
def verify_email_format(address):
    pattern = r'^[\w\.-]+@[\w\.-]+...

LanceDB search:

# search_code_lance.py
import lancedb
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
db = lancedb.connect("./lance_db")
table = db.open_table("codebase")

def search(query, top_k=5):
    query_vector = model.encode(query).tolist()
    
    # LanceDB returns pandas DataFrame
    results = table.search(query_vector).limit(top_k).to_pandas()
    
    for _, row in results.iterrows():
        print(f"\n{row['file']} (distance: {row['_distance']:.3f})")
        print(row['code'][:200] + "...")

search("email validation logic")

Step 4: Advanced Filtering

Filter by file path or metadata:

# ChromaDB with filters
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=10,
    where={"file": {"$contains": "auth"}}  # Only search auth folder
)

# LanceDB with SQL-like filters
results = table.search(query_vector) \
    .where("file LIKE '%auth%'") \
    .limit(5) \
    .to_pandas()

Why this matters: Narrow search to specific modules when you know the context.

Step 5: Update Index Efficiently

# Update only changed files
import hashlib
import json

def hash_file(filepath):
    with open(filepath, 'rb') as f:
        return hashlib.sha256(f.read()).hexdigest()

# Store hashes
hashes = json.load(open("file_hashes.json", "r")) if os.path.exists("file_hashes.json") else {}

for filepath in Path("./src").rglob("*.py"):
    current_hash = hash_file(filepath)
    
    if hashes.get(str(filepath)) != current_hash:
        # Re-index only this file
        reindex_file(filepath)
        hashes[str(filepath)] = current_hash

json.dump(hashes, open("file_hashes.json", "w"))

Why incremental updates: Avoids re-embedding your entire codebase on every change.

Verification

Test semantic search:

python search_code.py

You should see: Functions ranked by semantic similarity, not keyword matching. "Email validation" finds check_email_syntax() even though they share no words.

Performance check:

ChromaDB: ~10ms query latency for 10k snippets
LanceDB: ~3ms query latency for 100k snippets

What You Learned

Vector DBs enable semantic search (meaning, not keywords)
sentence-transformers runs locally, no API costs
ChromaDB = easy setup, LanceDB = production scale
Incremental indexing keeps it fast as codebase grows

Limitations:

Embeddings miss exact identifier searches (still use grep for def myFunction)
Model context: all-MiniLM-L6-v2 handles ~512 tokens, split long functions
Initial indexing takes time (100k files ~30 minutes)

When NOT to use this:

Small projects (<50 files) - grep is faster
Searching for specific variable names - use language servers
Real-time search during typing - this is batch indexing

Production Tips

1. Hybrid search (combine vector + keyword):

# ChromaDB: combine semantic + text search
results = collection.query(
    query_embeddings=[query_embedding],
    where_document={"$contains": "validate"}  # Must contain keyword
)

2. Better embeddings for code:

# Use code-specific model (350MB, slower but more accurate)
model = SentenceTransformer('microsoft/codebert-base')

3. RAM optimization:

# ChromaDB: limit memory usage
client = chromadb.PersistentClient(
    path="./chroma_db",
    settings=Settings(
        anonymized_telemetry=False,
        allow_reset=True
    )
)

4. Monitor index size:

# ChromaDB stores in SQLite + DuckDB
du -sh ./chroma_db  # Expect ~1MB per 1000 snippets

# LanceDB stores in Parquet
du -sh ./lance_db   # More efficient, ~0.5MB per 1000 snippets

Quick Reference

Installation:

# ChromaDB (easiest)
pip install chromadb sentence-transformers

# LanceDB (fastest)  
pip install lancedb sentence-transformers

Index code:

# ChromaDB: 3 lines
client = chromadb.PersistentClient(path="./db")
collection = client.create_collection("code")
collection.add(ids=[...], embeddings=[...], documents=[...])

# LanceDB: 2 lines
db = lancedb.connect("./db")
table = db.create_table("code", data=[{...}])

Search:

# ChromaDB
results = collection.query(query_embeddings=[...], n_results=5)

# LanceDB
results = table.search(vector).limit(5).to_pandas()

Tested with ChromaDB 0.5.7, LanceDB 0.14.0, sentence-transformers 3.3.1, Python 3.11+, macOS & Ubuntu