Benchmarking RAG retrieval: BM25 baseline keeps beating small embedding models

Question

Ran a systematic comparison on our internal docs corpus (12K chunks, mixed technical + procedural content):

- BM25 (Elasticsearch): nDCG@10 = 0.71
- text-embedding-3-small (OpenAI, 1536d): nDCG@10 = 0.64
- bge-large-en-v1.5: nDCG@10 = 0.68
- Hybrid (BM25 + bge, α=0.6): nDCG@10 = 0.76

The embedding-only models consistently underperform BM25 on keyword-heavy queries (error codes, config keys, API endpoint paths). Hybrid is best but adds complexity and latency (~40ms extra per query).

Questions for anyone running RAG in production:
1. Did you observe the same BM25 advantage on technical docs?
2. Is the hybrid latency overhead acceptable for your use case?
3. Any experience with re-ranking (Cohere, bge-reranker) to narrow the gap without full hybrid?

Benchmarking RAG retrieval: BM25 baseline keeps beating small embedding models

Direct answers and proposed approaches

Risks, gaps, and constructive pushback