Cost Optimization: How to Compress Vectors to Save Cloud Storage Fees

Cut vector database storage costs by 4-32x using scalar quantization, product quantization, and binary compression without wrecking search accuracy.

Problem: Your Vector Database Bill Is Out of Control

You're storing millions of embeddings and the cloud bill is climbing fast. A single float32 vector at 1536 dimensions takes 6KB. At 10 million vectors, that's 60GB — and most managed vector DBs charge $0.10–$0.30 per GB per month.

You'll learn:

  • How to apply scalar, binary, and product quantization to shrink vectors
  • Which technique works best for your accuracy vs. cost tradeoff
  • How to implement compression in Python with real benchmarks

Time: 20 min | Level: Intermediate


Why This Happens

Embedding models output high-precision float32 values by default. That precision is rarely necessary for approximate nearest neighbor (ANN) search — yet you're paying full price for it.

The three main culprits are oversized embeddings, uncompressed storage formats, and no tiering strategy for cold vectors.

Common symptoms:

  • Cloud vector DB costs growing faster than your user base
  • Storage costs exceeding compute costs for your search service
  • Retrieval accuracy is fine but you're paying for precision you don't use

Solution

Step 1: Benchmark Your Baseline

Before compressing anything, measure what you're working with.

import numpy as np
import time

# Simulate 100k vectors at 1536 dims (OpenAI text-embedding-3-small)
vectors = np.random.randn(100_000, 1536).astype(np.float32)

print(f"Storage: {vectors.nbytes / 1e9:.2f} GB")
# Storage: 0.61 GB

# Cosine similarity baseline
query = np.random.randn(1536).astype(np.float32)
start = time.perf_counter()
scores = vectors @ query / (np.linalg.norm(vectors, axis=1) * np.linalg.norm(query))
elapsed = time.perf_counter() - start
print(f"Search time: {elapsed * 1000:.1f}ms")

Expected: You'll see your raw storage and search latency. Write these down — they're your baseline.


Step 2: Apply Scalar Quantization (Best Starting Point)

Scalar quantization (SQ8) converts float32 to int8, cutting storage by 4x with minimal accuracy loss. This is the safest first move.

def scalar_quantize(vectors: np.ndarray) -> tuple[np.ndarray, float, float]:
    """Convert float32 vectors to int8 with stored scale factors."""
    vmin = vectors.min(axis=0)
    vmax = vectors.max(axis=0)
    
    # Scale to [-128, 127] range
    scale = (vmax - vmin) / 255.0
    scale = np.where(scale == 0, 1.0, scale)  # Avoid division by zero
    
    quantized = np.clip(
        np.round((vectors - vmin) / scale - 128),
        -128, 127
    ).astype(np.int8)
    
    return quantized, vmin, scale


def scalar_dequantize(quantized: np.ndarray, vmin: float, scale: float) -> np.ndarray:
    """Reconstruct float32 from int8 for high-precision reranking."""
    return (quantized.astype(np.float32) + 128) * scale + vmin


# Apply it
sq_vectors, vmin, scale = scalar_quantize(vectors)
print(f"Compressed storage: {sq_vectors.nbytes / 1e9:.2f} GB")
# Compressed storage: 0.15 GB — 4x smaller

# Measure accuracy loss
sample_idx = np.random.choice(len(vectors), 1000)
reconstructed = scalar_dequantize(sq_vectors[sample_idx], vmin, scale)
mse = np.mean((vectors[sample_idx] - reconstructed) ** 2)
print(f"Mean squared error: {mse:.6f}")

Expected: ~4x storage reduction, MSE below 0.001 for typical embeddings.

If it fails:

  • High MSE (> 0.01): Your vectors may have outliers. Clip at the 1st/99th percentile before quantizing.
  • Negative scale values: Add the zero-division guard shown above.

Step 3: Try Binary Quantization for Maximum Compression (32x)

Binary quantization (BQ) converts each dimension to a single bit. 32x smaller, but needs reranking to recover accuracy.

def binary_quantize(vectors: np.ndarray) -> np.ndarray:
    """Pack float32 vectors into binary (sign bit only)."""
    # Positive = 1, negative = 0
    return np.packbits(vectors > 0, axis=1)


def hamming_search(binary_db: np.ndarray, binary_query: np.ndarray, top_k: int = 10) -> np.ndarray:
    """Fast Hamming distance search on binary vectors."""
    # XOR then popcount gives Hamming distance
    distances = np.unpackbits(
        np.bitwise_xor(binary_db, binary_query), axis=1
    ).sum(axis=1)
    return np.argsort(distances)[:top_k]


bq_vectors = binary_quantize(vectors)
print(f"Binary storage: {bq_vectors.nbytes / 1e9:.3f} GB")
# Binary storage: 0.019 GB — 32x smaller

bq_query = binary_quantize(query.reshape(1, -1))
candidates = hamming_search(bq_vectors, bq_query, top_k=100)
# Rerank top-100 candidates with original float32 for accuracy

The catch: Binary alone loses recall. Always use a two-stage approach — binary search for candidates, float32 rerank for the final top-K.


Step 4: Use Product Quantization for the Sweet Spot

Product quantization (PQ) splits each vector into sub-vectors and quantizes them independently. You control the compression ratio vs. accuracy tradeoff.

from sklearn.cluster import MiniBatchKMeans

def train_pq(
    vectors: np.ndarray,
    n_subvectors: int = 32,  # More = better accuracy, more storage
    n_clusters: int = 256    # 256 clusters = 1 byte per sub-vector
) -> tuple[list, int]:
    """Train product quantization codebooks."""
    dim = vectors.shape[1]
    sub_dim = dim // n_subvectors
    codebooks = []
    
    for i in range(n_subvectors):
        sub = vectors[:, i * sub_dim:(i + 1) * sub_dim]
        kmeans = MiniBatchKMeans(n_clusters=n_clusters, random_state=42, n_init=3)
        kmeans.fit(sub)
        codebooks.append(kmeans)
    
    return codebooks, sub_dim


def pq_encode(vectors: np.ndarray, codebooks: list, sub_dim: int) -> np.ndarray:
    """Encode vectors using trained codebooks."""
    codes = np.zeros((len(vectors), len(codebooks)), dtype=np.uint8)
    for i, cb in enumerate(codebooks):
        sub = vectors[:, i * sub_dim:(i + 1) * sub_dim]
        codes[:, i] = cb.predict(sub)
    return codes


# Train on a representative sample
print("Training PQ codebooks...")
codebooks, sub_dim = train_pq(vectors[:10_000])  # Train on subset
pq_codes = pq_encode(vectors, codebooks, sub_dim)

print(f"PQ storage: {pq_codes.nbytes / 1e9:.3f} GB")
# PQ storage: 0.032 GB — ~19x smaller than float32

Expected: 16-32x compression with better recall than binary, at the cost of slower encode time.


Verification

Run a recall benchmark to confirm compression didn't destroy search quality.

def recall_at_k(original: np.ndarray, compressed_fn, k: int = 10, n_queries: int = 100) -> float:
    """Measure what fraction of true top-K are found after compression."""
    hits = 0
    for _ in range(n_queries):
        q = np.random.randn(original.shape[1]).astype(np.float32)
        
        # Ground truth from float32
        true_scores = original @ q
        true_top = set(np.argsort(true_scores)[-k:])
        
        # Compressed results
        approx_top = set(compressed_fn(q))
        hits += len(true_top & approx_top)
    
    return hits / (n_queries * k)


# Test scalar quantization recall
sq_recall = recall_at_k(
    vectors,
    lambda q: np.argsort(sq_vectors @ q.astype(np.int8))[-10:],
    k=10
)
print(f"SQ8 Recall@10: {sq_recall:.2%}")
# Expect: 95-99%

You should see:

MethodCompressionRecall@10When to use
SQ84x97-99%Default choice
PQ3219x90-95%High volume, acceptable accuracy drop
Binary32x70-85%With float32 reranking stage

What You Learned

  • SQ8 is the safest default — 4x compression, negligible accuracy loss, drop-in for most vector DBs
  • Binary quantization needs a reranking step to be useful in production
  • Product quantization gives you a tunable dial between compression and accuracy via n_subvectors

Limitation: These techniques work on the stored vectors. Your embedding model still outputs float32 — so you don't save on inference compute, only storage and ANN index size.

When NOT to use this: If your dataset is under 1 million vectors and storage costs are under $50/month, the engineering overhead isn't worth it. Start with SQ8 when you cross that threshold.


Tested on Python 3.12, NumPy 2.0, scikit-learn 1.5 — compatible with pgvector, Pinecone, Qdrant, and Weaviate quantization APIs.