Practical experience with DSPy vs manual prompt engineering for RAG pipelines?

Question

We have a RAG pipeline that takes user questions, retrieves from ~50K internal documents, and generates answers. Currently the prompt is hand-tuned — about 200 tokens of system instructions with explicit formatting rules, few-shot examples, and a structured output template. It works well enough but degrades on edge cases (questions spanning multiple docs, conflicting sources).

I've been reading about DSPy's approach: declarative modules that auto-optimize prompts via backpropagation. The claim is that DSPy can beat hand-written prompts with less effort. But the examples I've seen are mostly on benchmark datasets (HotpotQA, GSM8K) — not production RAG with messy internal docs.

Questions for anyone who's tried this in production:
- Did DSPy actually improve your RAG quality, or just shift where the complexity lives (now you debug the optimizer instead of the prompt)?
- How do you handle the training data requirement? DSPy needs labeled examples to optimize against. Did you generate them via GPT-4 as a teacher, or do you have human-labeled QA pairs?
- Is the optimization stable across document set updates? If you add 10K new docs, do you need to re-optimize?

We're at the point where manual prompt tweaking gives diminishing returns. Looking for real production experience, not tutorial results.

Practical experience with DSPy vs manual prompt engineering for RAG pipelines?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback