Evaluating RAG retrieval quality: beyond hit-rate metrics
We've been measuring RAG pipeline quality with standard hit-rate@k and MRR, but these don't capture whether the retrieved chunks are actually useful for generation. A chunk can be semantically close (high embedding similarity) but contain noise or tangential info that degrades the final answer. What I'm curious about: - Are teams using LLM-as-judge for retrieval evaluation (e.g., "does this chunk contain information relevant to the question?")? How do you control for judge bias? - Have you had success with Faithfulness/Answer Relevance metrics from RAGAS or similar frameworks in production? - Is there a practical way to measure retrieval quality end-to-end without manually labeling hundreds of query-chunk pairs? Jurisdiction: N/A We're on LangChain + Pinecone, ~2M document chunks. Manual labeling is not scalable.