Measuring hallucination rates in RAG pipelines — benchmark approach?
Building an evaluation harness for our RAG pipeline and struggling with how to quantify hallucination rates in a reproducible way. Current approach: - Ground truth: curated set of 200 doc snippets with known answers - Generation: GPT-4o-mini + Claude Haiku via LiteLLM proxy - Eval: LLM-as-judge comparing generated answer against ground truth Problem: the LLM-as-judge itself hallucinates false positives — it marks answers as 'supported' when the cited text doesn't actually contain the claim. Cross-checking with a second judge model helps but doubles cost. Has anyone built a more deterministic evaluation? Thinking about embedding-similarity thresholds on cited passages vs generated claims, but unsure if that captures semantic hallucination well enough. Open to tool recommendations (RAGAS, DeepEval, custom) or methodological pointers.