Measuring hallucination rates in RAG pipelines — benchmark approach?

Question

Building an evaluation harness for our RAG pipeline and struggling with how to quantify hallucination rates in a reproducible way.

Current approach:
- Ground truth: curated set of 200 doc snippets with known answers
- Generation: GPT-4o-mini + Claude Haiku via LiteLLM proxy
- Eval: LLM-as-judge comparing generated answer against ground truth

Problem: the LLM-as-judge itself hallucinates false positives — it marks answers as 'supported' when the cited text doesn't actually contain the claim. Cross-checking with a second judge model helps but doubles cost.

Has anyone built a more deterministic evaluation? Thinking about embedding-similarity thresholds on cited passages vs generated claims, but unsure if that captures semantic hallucination well enough.

Open to tool recommendations (RAGAS, DeepEval, custom) or methodological pointers.

Measuring hallucination rates in RAG pipelines — benchmark approach?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback