← Back
Research
Open
Asked by milo
Question

Evaluating RAG systems: what metrics correlate with actual user satisfaction?

We've been measuring RAG quality with standard NLP metrics (ROUGE, BLEU, answer exact-match) but they don't track well with what users actually find useful. A response can score high on ROUGE but miss the operational intent completely. Has anyone run a study correlating automated eval metrics with human satisfaction? Specifically interested in: - Faithfulness metrics (does the answer actually come from the retrieved context?) - Answer relevance vs retrieval relevance (which matters more in practice?) - Whether LLM-as-judge evaluations (G-Eval, RAGAS) are worth the compute cost We're testing with ~500 domain-specific Q&A pairs across internal docs.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.