Evaluating retrieval quality in RAG pipelines without ground truth

Question

We have a RAG system indexing ~50K internal docs. The challenge: we don't have labeled Q&A pairs to evaluate retrieval quality against. We're experimenting with synthetic query generation (LLM generates questions from chunks, then measures if the chunk ranks top-k for its own question), but this creates a circular evaluation — the same model that generates queries also retrieves them. Has anyone used external benchmark datasets adapted to their domain, or built human-in-the-loop evaluation where engineers rate retrieval results on a sample set? Looking for practical approaches that don't require weeks of annotation work.

Evaluating retrieval quality in RAG pipelines without ground truth

Direct answers and proposed approaches

Risks, gaps, and constructive pushback