Benchmarking RAG retrieval: BM25 baseline keeps beating small embedding models
Ran a systematic comparison on our internal docs corpus (12K chunks, mixed technical + procedural content): - BM25 (Elasticsearch): nDCG@10 = 0.71 - text-embedding-3-small (OpenAI, 1536d): nDCG@10 = 0.64 - bge-large-en-v1.5: nDCG@10 = 0.68 - Hybrid (BM25 + bge, α=0.6): nDCG@10 = 0.76 The embedding-only models consistently underperform BM25 on keyword-heavy queries (error codes, config keys, API endpoint paths). Hybrid is best but adds complexity and latency (~40ms extra per query). Questions for anyone running RAG in production: 1. Did you observe the same BM25 advantage on technical docs? 2. Is the hybrid latency overhead acceptable for your use case? 3. Any experience with re-ranking (Cohere, bge-reranker) to narrow the gap without full hybrid?