Cost Optimization: How to Compress Vectors to Save Cloud Storage Fees

Problem: Your Vector Database Bill Is Out of Control

You're storing millions of embeddings and the cloud bill is climbing fast. A single float32 vector at 1536 dimensions takes 6KB. At 10 million vectors, that's 60GB — and most managed vector DBs charge $0.10–$0.30 per GB per month.

You'll learn:

How to apply scalar, binary, and product quantization to shrink vectors
Which technique works best for your accuracy vs. cost tradeoff
How to implement compression in Python with real benchmarks

Time: 20 min | Level: Intermediate

Why This Happens

Embedding models output high-precision float32 values by default. That precision is rarely necessary for approximate nearest neighbor (ANN) search — yet you're paying full price for it.

The three main culprits are oversized embeddings, uncompressed storage formats, and no tiering strategy for cold vectors.

Common symptoms:

Cloud vector DB costs growing faster than your user base
Storage costs exceeding compute costs for your search service
Retrieval accuracy is fine but you're paying for precision you don't use

Solution

Step 1: Benchmark Your Baseline

Before compressing anything, measure what you're working with.

import numpy as np
import time

# Simulate 100k vectors at 1536 dims (OpenAI text-embedding-3-small)
vectors = np.random.randn(100_000, 1536).astype(np.float32)

print(f"Storage: {vectors.nbytes / 1e9:.2f} GB")
# Storage: 0.61 GB

# Cosine similarity baseline
query = np.random.randn(1536).astype(np.float32)
start = time.perf_counter()
scores = vectors @ query / (np.linalg.norm(vectors, axis=1) * np.linalg.norm(query))
elapsed = time.perf_counter() - start
print(f"Search time: {elapsed * 1000:.1f}ms")

Expected: You'll see your raw storage and search latency. Write these down — they're your baseline.

Step 2: Apply Scalar Quantization (Best Starting Point)

Scalar quantization (SQ8) converts float32 to int8, cutting storage by 4x with minimal accuracy loss. This is the safest first move.

def scalar_quantize(vectors: np.ndarray) -> tuple[np.ndarray, float, float]:
    """Convert float32 vectors to int8 with stored scale factors."""
    vmin = vectors.min(axis=0)
    vmax = vectors.max(axis=0)
    
    # Scale to [-128, 127] range
    scale = (vmax - vmin) / 255.0
    scale = np.where(scale == 0, 1.0, scale)  # Avoid division by zero
    
    quantized = np.clip(
        np.round((vectors - vmin) / scale - 128),
        -128, 127
    ).astype(np.int8)
    
    return quantized, vmin, scale


def scalar_dequantize(quantized: np.ndarray, vmin: float, scale: float) -> np.ndarray:
    """Reconstruct float32 from int8 for high-precision reranking."""
    return (quantized.astype(np.float32) + 128) * scale + vmin


# Apply it
sq_vectors, vmin, scale = scalar_quantize(vectors)
print(f"Compressed storage: {sq_vectors.nbytes / 1e9:.2f} GB")
# Compressed storage: 0.15 GB — 4x smaller

# Measure accuracy loss
sample_idx = np.random.choice(len(vectors), 1000)
reconstructed = scalar_dequantize(sq_vectors[sample_idx], vmin, scale)
mse = np.mean((vectors[sample_idx] - reconstructed) ** 2)
print(f"Mean squared error: {mse:.6f}")

Expected: ~4x storage reduction, MSE below 0.001 for typical embeddings.

If it fails:

High MSE (> 0.01): Your vectors may have outliers. Clip at the 1st/99th percentile before quantizing.
Negative scale values: Add the zero-division guard shown above.

Step 3: Try Binary Quantization for Maximum Compression (32x)

Binary quantization (BQ) converts each dimension to a single bit. 32x smaller, but needs reranking to recover accuracy.

def binary_quantize(vectors: np.ndarray) -> np.ndarray:
    """Pack float32 vectors into binary (sign bit only)."""
    # Positive = 1, negative = 0
    return np.packbits(vectors > 0, axis=1)


def hamming_search(binary_db: np.ndarray, binary_query: np.ndarray, top_k: int = 10) -> np.ndarray:
    """Fast Hamming distance search on binary vectors."""
    # XOR then popcount gives Hamming distance
    distances = np.unpackbits(
        np.bitwise_xor(binary_db, binary_query), axis=1
    ).sum(axis=1)
    return np.argsort(distances)[:top_k]


bq_vectors = binary_quantize(vectors)
print(f"Binary storage: {bq_vectors.nbytes / 1e9:.3f} GB")
# Binary storage: 0.019 GB — 32x smaller

bq_query = binary_quantize(query.reshape(1, -1))
candidates = hamming_search(bq_vectors, bq_query, top_k=100)
# Rerank top-100 candidates with original float32 for accuracy

The catch: Binary alone loses recall. Always use a two-stage approach — binary search for candidates, float32 rerank for the final top-K.

Step 4: Use Product Quantization for the Sweet Spot

Product quantization (PQ) splits each vector into sub-vectors and quantizes them independently. You control the compression ratio vs. accuracy tradeoff.

from sklearn.cluster import MiniBatchKMeans

def train_pq(
    vectors: np.ndarray,
    n_subvectors: int = 32,  # More = better accuracy, more storage
    n_clusters: int = 256    # 256 clusters = 1 byte per sub-vector
) -> tuple[list, int]:
    """Train product quantization codebooks."""
    dim = vectors.shape[1]
    sub_dim = dim // n_subvectors
    codebooks = []
    
    for i in range(n_subvectors):
        sub = vectors[:, i * sub_dim:(i + 1) * sub_dim]
        kmeans = MiniBatchKMeans(n_clusters=n_clusters, random_state=42, n_init=3)
        kmeans.fit(sub)
        codebooks.append(kmeans)
    
    return codebooks, sub_dim


def pq_encode(vectors: np.ndarray, codebooks: list, sub_dim: int) -> np.ndarray:
    """Encode vectors using trained codebooks."""
    codes = np.zeros((len(vectors), len(codebooks)), dtype=np.uint8)
    for i, cb in enumerate(codebooks):
        sub = vectors[:, i * sub_dim:(i + 1) * sub_dim]
        codes[:, i] = cb.predict(sub)
    return codes


# Train on a representative sample
print("Training PQ codebooks...")
codebooks, sub_dim = train_pq(vectors[:10_000])  # Train on subset
pq_codes = pq_encode(vectors, codebooks, sub_dim)

print(f"PQ storage: {pq_codes.nbytes / 1e9:.3f} GB")
# PQ storage: 0.032 GB — ~19x smaller than float32

Expected: 16-32x compression with better recall than binary, at the cost of slower encode time.

Verification

Run a recall benchmark to confirm compression didn't destroy search quality.

def recall_at_k(original: np.ndarray, compressed_fn, k: int = 10, n_queries: int = 100) -> float:
    """Measure what fraction of true top-K are found after compression."""
    hits = 0
    for _ in range(n_queries):
        q = np.random.randn(original.shape[1]).astype(np.float32)
        
        # Ground truth from float32
        true_scores = original @ q
        true_top = set(np.argsort(true_scores)[-k:])
        
        # Compressed results
        approx_top = set(compressed_fn(q))
        hits += len(true_top & approx_top)
    
    return hits / (n_queries * k)


# Test scalar quantization recall
sq_recall = recall_at_k(
    vectors,
    lambda q: np.argsort(sq_vectors @ q.astype(np.int8))[-10:],
    k=10
)
print(f"SQ8 Recall@10: {sq_recall:.2%}")
# Expect: 95-99%

You should see:

Method	Compression	Recall@10	When to use
SQ8	4x	97-99%	Default choice
PQ32	19x	90-95%	High volume, acceptable accuracy drop
Binary	32x	70-85%	With float32 reranking stage

What You Learned

SQ8 is the safest default — 4x compression, negligible accuracy loss, drop-in for most vector DBs
Binary quantization needs a reranking step to be useful in production
Product quantization gives you a tunable dial between compression and accuracy via n_subvectors

Limitation: These techniques work on the stored vectors. Your embedding model still outputs float32 — so you don't save on inference compute, only storage and ANN index size.

When NOT to use this: If your dataset is under 1 million vectors and storage costs are under $50/month, the engineering overhead isn't worth it. Start with SQ8 when you cross that threshold.

Tested on Python 3.12, NumPy 2.0, scikit-learn 1.5 — compatible with pgvector, Pinecone, Qdrant, and Weaviate quantization APIs.