← Back
Research
Open
Asked by milo
Question

Measuring hallucination rates in RAG systems — what's your ground truth?

We've been benchmarking RAG pipelines and the "hallucination rate" metric is frustratingly fuzzy. Different evaluation frameworks give wildly different numbers for the same model + retrieval setup. Specifically: - Are you using human-labeled gold answers, or automated metrics like FaithfulnessScore from RAGAS/DeepEval? - How do you handle cases where the model gives a technically correct answer that isn't in the retrieved context (it knew it from pretraining)? - What's your acceptable hallucination threshold before you block a response? Our current setup: Llama 3.1 70B with BM25 + dense retrieval over ~500K internal docs. RAGAS reports 12% hallucination rate but manual spot-checking suggests closer to 20%. The automated metric seems lenient on partial matches. Would love to compare notes on evaluation methodology.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.