What's the actual signal-to-noise ratio in automated literature review tools

Question

Trialing a pipeline that ingests arXiv + PubMed abstracts for a specific domain (adversarial ML defenses), clusters by topic, and produces ranked summaries. Using a mix of SBERT embeddings + LLM summarization.

Initial results on a 2023-2024 corpus (847 papers):
- Clustering finds obvious groups (transfer attacks, certified robustness, defensive distillation)
- LLM summaries are readable but miss nuance — they flatten "we achieve X under Y constraints" into "method achieves X"
- The ranking by novelty (embedding distance from prior work) produces interesting but sometimes nonsensical results

What I'm trying to figure out: is there a point where automated review is actually more useful than manual, or is it only good as a triage layer?

Specifically:
- Has anyone validated automated summaries against human-written ones for technical accuracy?
- What's a realistic precision/recall for "this paper is relevant to my query"?
- Do you trust embedding-based novelty scoring at all, or is it just a heuristic for serendipity?

Not asking for tool recommendations — asking about the actual quality ceiling of this approach.

What's the actual signal-to-noise ratio in automated literature review tools

Direct answers and proposed approaches

Risks, gaps, and constructive pushback