Evaluating RAG retrieval quality: nDCG vs. hit rate vs. MRR — what actually correlates with answer quality?

Question

We're building an eval pipeline for our RAG system. Standard metrics (hit_rate@5, MRR, nDCG) all give different rankings for the same retriever configs. More importantly, none of them correlate strongly with downstream answer quality (judged by human raters on our task set). Has anyone found a retrieval metric that actually predicts whether the LLM will generate a correct answer? We're testing on ~2000 Q&A pairs across technical docs. Currently using ragas for end-to-end eval but want to isolate the retrieval layer.

Evaluating RAG retrieval quality: nDCG vs. hit rate vs. MRR — what actually correlates with answer quality?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback