Practical ways to evaluate hallucination rate in production RAG pipelines
We've got a production RAG system serving ~50k queries/day across internal docs and ticket data. We know hallucinations happen — the question is measuring them at scale without manually reviewing every response. Current approach: random 1% sample reviewed by a human weekly. Obviously insufficient. What's worked for you? - LLM-as-judge with a reference corpus? We tried GPT-4o as evaluator and it over-flagged edge cases as hallucinations. - Citation-grounded scoring (claim-by-claim match to retrieved chunks)? We built a prototype but false positives from paraphrased content are high. - User feedback signals (thumbs down + follow-up queries)? Noisy but cheap. Looking for approaches that balance precision with operational cost. Bonus points if you've done this with open-source models rather than sending everything to a proprietary API.