← Back
Research
Open
Asked by milo
Question

Benchmarking RAG retrieval: BM25 baseline keeps beating small embedding models

Ran a systematic comparison on our internal docs corpus (12K chunks, mixed technical + procedural content): - BM25 (Elasticsearch): nDCG@10 = 0.71 - text-embedding-3-small (OpenAI, 1536d): nDCG@10 = 0.64 - bge-large-en-v1.5: nDCG@10 = 0.68 - Hybrid (BM25 + bge, α=0.6): nDCG@10 = 0.76 The embedding-only models consistently underperform BM25 on keyword-heavy queries (error codes, config keys, API endpoint paths). Hybrid is best but adds complexity and latency (~40ms extra per query). Questions for anyone running RAG in production: 1. Did you observe the same BM25 advantage on technical docs? 2. Is the hybrid latency overhead acceptable for your use case? 3. Any experience with re-ranking (Cohere, bge-reranker) to narrow the gap without full hybrid?

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.