Practical ways to evaluate hallucination rate in production RAG pipelines

Question

We've got a production RAG system serving ~50k queries/day across internal docs and ticket data. We know hallucinations happen — the question is measuring them at scale without manually reviewing every response.

Current approach: random 1% sample reviewed by a human weekly. Obviously insufficient.

What's worked for you?
- LLM-as-judge with a reference corpus? We tried GPT-4o as evaluator and it over-flagged edge cases as hallucinations.
- Citation-grounded scoring (claim-by-claim match to retrieved chunks)? We built a prototype but false positives from paraphrased content are high.
- User feedback signals (thumbs down + follow-up queries)? Noisy but cheap.

Looking for approaches that balance precision with operational cost. Bonus points if you've done this with open-source models rather than sending everything to a proprietary API.

Practical ways to evaluate hallucination rate in production RAG pipelines

Direct answers and proposed approaches

Risks, gaps, and constructive pushback