Pick the Right Embedding Model: OpenAI vs. BGE-M3

Choose between OpenAI text-embedding-3 and BGE-M3 for your RAG or search system. Covers latency, cost, multilingual needs, and deployment trade-offs.

Problem: Choosing the Wrong Embedding Model Costs You Later

You're building a RAG pipeline or semantic search system and need to pick an embedding model. OpenAI's text-embedding-3 is the easy default — but BGE-M3 from BAAI often outperforms it on retrieval benchmarks, runs locally, and costs nothing per token.

You'll learn:

  • When OpenAI embeddings are worth the API cost
  • When BGE-M3 is the better production choice
  • How to benchmark both on your actual data in under 30 minutes

Time: 12 min | Level: Intermediate


Why This Matters

Embedding model choice is sticky. Once you embed a million documents, switching models means re-embedding everything. Picking the wrong model early leads to degraded retrieval quality, unexpected API bills, or compliance issues if your data can't leave your infrastructure.

Common failure modes:

  • Defaulting to OpenAI because it's easy, then hitting rate limits at scale
  • Using BGE-M3 for a multilingual use case without testing on your target languages
  • Benchmarking on generic datasets instead of your own documents

The Core Trade-offs

OpenAI text-embedding-3-small / text-embedding-3-large

Use this when:

  • You need managed infrastructure with SLA guarantees
  • Your team can't operate a GPU server
  • Low query volume keeps costs acceptable (< ~5M tokens/day)
  • You're already deep in the OpenAI ecosystem

Watch out for:

  • Cost compounds fast at scale — text-embedding-3-large is 4× more expensive than small
  • Data leaves your network on every query
  • Rate limits can bottleneck high-throughput indexing jobs
  • Vendor lock-in: dimensions differ from open models, so you can't hot-swap
from openai import OpenAI

client = OpenAI()

def embed_openai(texts: list[str], model="text-embedding-3-small") -> list[list[float]]:
    response = client.embeddings.create(input=texts, model=model)
    # Returns 1536-dim vectors for small, 3072-dim for large
    return [item.embedding for item in response.data]

Expected: 1536-dim float vectors, ~200ms latency per batch of 100 texts.


BGE-M3

BGE-M3 (from BAAI) is a single model that supports dense retrieval, sparse retrieval (like BM25), and multi-vector (ColBERT-style) retrieval — all from one checkpoint. It's trained on 100+ languages.

Use this when:

  • You need data to stay on-prem (compliance, privacy)
  • Query volume is high enough that API costs matter
  • You need multilingual retrieval across non-English corpora
  • You want hybrid dense+sparse retrieval without two separate models

Watch out for:

  • Requires GPU for production throughput (CPU works but is slow)
  • First-time model download is ~2.3GB
  • Slightly higher p99 latency vs. OpenAI on cold starts
from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=True)  # fp16 halves VRAM usage

def embed_bge(texts: list[str]) -> dict:
    # Returns dense, sparse, and colbert_vecs
    return model.encode(
        texts,
        batch_size=12,
        max_length=8192,  # BGE-M3 supports up to 8192 tokens
        return_dense=True,
        return_sparse=True,   # Enable for hybrid search
        return_colbert_vecs=False  # Only for re-ranking pipelines
    )

Expected: output["dense_vecs"] is a numpy array of shape (n, 1024).

If it fails:

  • CUDA OOM: Reduce batch_size to 4 and enable use_fp16=True
  • Slow on CPU: Set TOKENIZERS_PARALLELISM=false and use batch_size=1

Step-by-Step: Benchmark Both on Your Data

Don't trust generic leaderboards. Run this on 200–500 representative queries and documents from your own corpus.

Step 1: Set Up the Environment

pip install openai FlagEmbedding faiss-cpu numpy --break-system-packages

Step 2: Build a Simple Retrieval Test

import numpy as np
import faiss

def cosine_index(vectors: np.ndarray) -> faiss.IndexFlatIP:
    # Normalize for cosine similarity with inner product index
    faiss.normalize_L2(vectors)
    index = faiss.IndexFlatIP(vectors.shape[1])
    index.add(vectors)
    return index

def recall_at_k(index, query_vecs: np.ndarray, ground_truth: list[int], k=10) -> float:
    faiss.normalize_L2(query_vecs)
    _, indices = index.search(query_vecs, k)
    hits = sum(gt in row for gt, row in zip(ground_truth, indices))
    return hits / len(ground_truth)

Step 3: Compare Recall@10

# Embed your corpus with both models
corpus_openai = np.array(embed_openai(corpus_texts))
corpus_bge    = embed_bge(corpus_texts)["dense_vecs"].astype("float32")

# Build indexes
idx_openai = cosine_index(corpus_openai)
idx_bge    = cosine_index(corpus_bge)

# Embed queries
q_openai = np.array(embed_openai(query_texts))
q_bge    = embed_bge(query_texts)["dense_vecs"].astype("float32")

print(f"OpenAI Recall@10: {recall_at_k(idx_openai, q_openai, ground_truth):.3f}")
print(f"BGE-M3  Recall@10: {recall_at_k(idx_bge, q_bge, ground_truth):.3f}")

You should see: BGE-M3 matches or beats OpenAI on domain-specific corpora in most benchmarks, especially for longer documents (8192 token context vs. OpenAI's 8191 practical limit).


Decision Matrix

FactorOpenAI text-embedding-3-smallBGE-M3
Setup time5 minutes20–30 minutes
Cost at 100M tokens/day~$2,000/dayInfrastructure only
Max context8191 tokens8192 tokens
LanguagesEnglish-dominant100+ languages
Data privacySent to OpenAIStays local
GPU requiredNoRecommended
Hybrid searchNo (dense only)Yes (dense + sparse)
Dimension1536 / 30721024

Verification

python -c "
from FlagEmbedding import BGEM3FlagModel
m = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
out = m.encode(['test'], return_dense=True)
print('BGE-M3 dim:', out['dense_vecs'].shape[1])  # Expected: 1024
"

You should see: BGE-M3 dim: 1024


What You Learned

  • OpenAI embeddings are optimal when managed infrastructure and simplicity matter more than cost at scale
  • BGE-M3 is the better default for multilingual, high-volume, or privacy-sensitive deployments
  • Always benchmark Recall@10 on your own data — leaderboard rankings rarely reflect domain-specific performance
  • BGE-M3's hybrid mode (dense + sparse) can close the gap on keyword-heavy queries where dense-only embeddings struggle

Limitation: BGE-M3 colbert_vecs mode significantly improves re-ranking accuracy but requires storing multi-vector representations per document — storage increases 5–10×. Only enable it if you have a re-ranking stage.


Tested on BGE-M3 v1.0, FlagEmbedding 1.2.x, OpenAI Python SDK v1.x, Python 3.12, CUDA 12.1