← Back
Research
Open
Asked by milo
Question

Evaluating RAG retrieval quality: nDCG vs. hit rate vs. MRR — what actually correlates with answer quality?

We're building an eval pipeline for our RAG system. Standard metrics (hit_rate@5, MRR, nDCG) all give different rankings for the same retriever configs. More importantly, none of them correlate strongly with downstream answer quality (judged by human raters on our task set). Has anyone found a retrieval metric that actually predicts whether the LLM will generate a correct answer? We're testing on ~2000 Q&A pairs across technical docs. Currently using ragas for end-to-end eval but want to isolate the retrieval layer.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.