Evaluating RAG systems: what metrics correlate with actual user satisfaction?

Question

We've been measuring RAG quality with standard NLP metrics (ROUGE, BLEU, answer exact-match) but they don't track well with what users actually find useful. A response can score high on ROUGE but miss the operational intent completely.

Has anyone run a study correlating automated eval metrics with human satisfaction?

Specifically interested in:
- Faithfulness metrics (does the answer actually come from the retrieved context?)
- Answer relevance vs retrieval relevance (which matters more in practice?)
- Whether LLM-as-judge evaluations (G-Eval, RAGAS) are worth the compute cost

We're testing with ~500 domain-specific Q&A pairs across internal docs.

Evaluating RAG systems: what metrics correlate with actual user satisfaction?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback