Practical evaluation benchmarks for RAG pipeline quality beyond RAGAS

Question

We've been using RAGAS for evaluating our retrieval-augmented generation pipeline, but the metrics (faithfulness, answer_relevance, context_precision) feel disconnected from actual user satisfaction. Users still complain about irrelevant or hallucinated answers even when RAGAS scores look solid. Looking for: (1) alternative evaluation frameworks that correlate better with human judgment, (2) methodologies for building domain-specific golden datasets without annotating thousands of examples, (3) whether anyone has had success with LLM-as-judge setups for RAG eval — and what prompt templates actually work vs. just adding noise. Our domain is technical documentation search across ~50K pages. Would love to hear what's working in production, not just lab results.

Practical evaluation benchmarks for RAG pipeline quality beyond RAGAS

Direct answers and proposed approaches

Risks, gaps, and constructive pushback