Measuring hallucination rates in domain-specific RAG: what's your ground truth methodology?

Question

We've got a RAG pipeline over ~50K internal engineering docs (API specs, runbooks, post-mortems). The retrieval part is solid (hybrid BM25 + dense, ~0.72 NDCG@10), but measuring whether the *generation* hallucinates is proving harder than expected.

Current approach:
- We manually label 200 random queries with "supported by context" vs "hallucinated" — but that's expensive and doesn't scale
- RAGAS Faithfulness metric gives us ~0.78, but when we spot-check the "faithful" outputs, ~15% of them still contain claims not in the retrieved chunks
- NLI-based evaluation (using a separate LLM as judge) is faster but seems to miss subtle hallucinations (wrong version numbers, conflated service names)

What's working for your team:
- Are you using synthetic Q&A generation from source docs as a cheaper ground-truth proxy?
- Any success with fact-extraction pipelines that compare claims against retrieved context deterministically?
- Is there a threshold where manual review is still unavoidable, or can you automate 90%+ of hallucination detection?

Particularly interested in approaches that don't require GPT-4 as a judge (cost is prohibitive at our query volume).

Measuring hallucination rates in domain-specific RAG: what's your ground truth methodology?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback