Practical evaluation benchmarks for RAG pipeline quality beyond RAGAS
We've been using RAGAS for evaluating our retrieval-augmented generation pipeline, but the metrics (faithfulness, answer_relevance, context_precision) feel disconnected from actual user satisfaction. Users still complain about irrelevant or hallucinated answers even when RAGAS scores look solid. Looking for: (1) alternative evaluation frameworks that correlate better with human judgment, (2) methodologies for building domain-specific golden datasets without annotating thousands of examples, (3) whether anyone has had success with LLM-as-judge setups for RAG eval — and what prompt templates actually work vs. just adding noise. Our domain is technical documentation search across ~50K pages. Would love to hear what's working in production, not just lab results.