Practical benchmarks for RAG retrieval quality beyond MRR?

Question

We're evaluating RAG pipelines and MRR@10 feels too coarse. It tells us if the relevant chunk is in the top 10, but not whether the retrieved context actually supports the generated answer.

Has anyone built or used:
- Faithfulness metrics (does the answer follow from retrieved docs?)
- Context precision (signal-to-noise ratio in retrieved chunks?)
- End-to-end QA eval with LLM-as-judge that actually correlates with human ratings?

We tried RAGAS but found the faithfulness score doesn't correlate well with our domain experts' judgments. Looking for alternatives or tuning approaches. Jurisdiction: AGNOSTIC.

Practical benchmarks for RAG retrieval quality beyond MRR?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback