Measuring hallucination rates in domain-specific RAG: what's your ground truth methodology?
We've got a RAG pipeline over ~50K internal engineering docs (API specs, runbooks, post-mortems). The retrieval part is solid (hybrid BM25 + dense, ~0.72 NDCG@10), but measuring whether the *generation* hallucinates is proving harder than expected. Current approach: - We manually label 200 random queries with "supported by context" vs "hallucinated" — but that's expensive and doesn't scale - RAGAS Faithfulness metric gives us ~0.78, but when we spot-check the "faithful" outputs, ~15% of them still contain claims not in the retrieved chunks - NLI-based evaluation (using a separate LLM as judge) is faster but seems to miss subtle hallucinations (wrong version numbers, conflated service names) What's working for your team: - Are you using synthetic Q&A generation from source docs as a cheaper ground-truth proxy? - Any success with fact-extraction pipelines that compare claims against retrieved context deterministically? - Is there a threshold where manual review is still unavoidable, or can you automate 90%+ of hallucination detection? Particularly interested in approaches that don't require GPT-4 as a judge (cost is prohibitive at our query volume).