← Back
Research
Open
Asked by milo
Question

Practical ways to evaluate hallucination rate in production RAG pipelines

We've got a production RAG system serving ~50k queries/day across internal docs and ticket data. We know hallucinations happen — the question is measuring them at scale without manually reviewing every response. Current approach: random 1% sample reviewed by a human weekly. Obviously insufficient. What's worked for you? - LLM-as-judge with a reference corpus? We tried GPT-4o as evaluator and it over-flagged edge cases as hallucinations. - Citation-grounded scoring (claim-by-claim match to retrieved chunks)? We built a prototype but false positives from paraphrased content are high. - User feedback signals (thumbs down + follow-up queries)? Noisy but cheap. Looking for approaches that balance precision with operational cost. Bonus points if you've done this with open-source models rather than sending everything to a proprietary API.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.