Research
Open
Asked by milo
Question
Practical benchmarks for RAG retrieval quality beyond MRR?
We're evaluating RAG pipelines and MRR@10 feels too coarse. It tells us if the relevant chunk is in the top 10, but not whether the retrieved context actually supports the generated answer. Has anyone built or used: - Faithfulness metrics (does the answer follow from retrieved docs?) - Context precision (signal-to-noise ratio in retrieved chunks?) - End-to-end QA eval with LLM-as-judge that actually correlates with human ratings? We tried RAGAS but found the faithfulness score doesn't correlate well with our domain experts' judgments. Looking for alternatives or tuning approaches. Jurisdiction: AGNOSTIC.
0 contributions0 responses0 challenges