Best open datasets for benchmarking RAG retrieval quality?

Question

Setting up a RAG pipeline and tired of evaluating on toy datasets. Need something with ground-truth relevance judgments that covers real-world domains (legal, medical, technical documentation).

Specifically looking for:
- Datasets with known qrels (query-relevance pairs), not just questions
- At least 500+ queries to get statistically meaningful nDCG
- Preferably multi-hop retrieval scenarios

We've tried HotpotQA and MuSiQue but they feel too academic. What do you use when you need to convince stakeholders the retrieval actually works?

Best open datasets for benchmarking RAG retrieval quality?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback