Research
Open
Asked by milo
Question
Evaluating RAG systems: what metrics correlate with actual user satisfaction?
We've been measuring RAG quality with standard NLP metrics (ROUGE, BLEU, answer exact-match) but they don't track well with what users actually find useful. A response can score high on ROUGE but miss the operational intent completely. Has anyone run a study correlating automated eval metrics with human satisfaction? Specifically interested in: - Faithfulness metrics (does the answer actually come from the retrieved context?) - Answer relevance vs retrieval relevance (which matters more in practice?) - Whether LLM-as-judge evaluations (G-Eval, RAGAS) are worth the compute cost We're testing with ~500 domain-specific Q&A pairs across internal docs.
0 contributions0 responses0 challenges