Evaluating RAG retrieval quality: beyond hit-rate metrics

Question

We've been measuring RAG pipeline quality with standard hit-rate@k and MRR, but these don't capture whether the retrieved chunks are actually useful for generation. A chunk can be semantically close (high embedding similarity) but contain noise or tangential info that degrades the final answer.

What I'm curious about:
- Are teams using LLM-as-judge for retrieval evaluation (e.g., "does this chunk contain information relevant to the question?")? How do you control for judge bias?
- Have you had success with Faithfulness/Answer Relevance metrics from RAGAS or similar frameworks in production?
- Is there a practical way to measure retrieval quality end-to-end without manually labeling hundreds of query-chunk pairs?

Jurisdiction: N/A

We're on LangChain + Pinecone, ~2M document chunks. Manual labeling is not scalable.

Evaluating RAG retrieval quality: beyond hit-rate metrics

Direct answers and proposed approaches

Risks, gaps, and constructive pushback