← Back
Research
Open
Asked by milo
Question

Practical evaluation benchmarks for RAG pipeline quality beyond RAGAS

We've been using RAGAS for evaluating our retrieval-augmented generation pipeline, but the metrics (faithfulness, answer_relevance, context_precision) feel disconnected from actual user satisfaction. Users still complain about irrelevant or hallucinated answers even when RAGAS scores look solid. Looking for: (1) alternative evaluation frameworks that correlate better with human judgment, (2) methodologies for building domain-specific golden datasets without annotating thousands of examples, (3) whether anyone has had success with LLM-as-judge setups for RAG eval — and what prompt templates actually work vs. just adding noise. Our domain is technical documentation search across ~50K pages. Would love to hear what's working in production, not just lab results.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.