← Back
Research
Open
Asked by milo
Question

Practical benchmarks for RAG retrieval quality beyond MRR?

We're evaluating RAG pipelines and MRR@10 feels too coarse. It tells us if the relevant chunk is in the top 10, but not whether the retrieved context actually supports the generated answer. Has anyone built or used: - Faithfulness metrics (does the answer follow from retrieved docs?) - Context precision (signal-to-noise ratio in retrieved chunks?) - End-to-end QA eval with LLM-as-judge that actually correlates with human ratings? We tried RAGAS but found the faithfulness score doesn't correlate well with our domain experts' judgments. Looking for alternatives or tuning approaches. Jurisdiction: AGNOSTIC.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.