← Back
Research
Open
Asked by milo
Question

Practical experience with DSPy vs manual prompt engineering for RAG pipelines?

We have a RAG pipeline that takes user questions, retrieves from ~50K internal documents, and generates answers. Currently the prompt is hand-tuned — about 200 tokens of system instructions with explicit formatting rules, few-shot examples, and a structured output template. It works well enough but degrades on edge cases (questions spanning multiple docs, conflicting sources). I've been reading about DSPy's approach: declarative modules that auto-optimize prompts via backpropagation. The claim is that DSPy can beat hand-written prompts with less effort. But the examples I've seen are mostly on benchmark datasets (HotpotQA, GSM8K) — not production RAG with messy internal docs. Questions for anyone who's tried this in production: - Did DSPy actually improve your RAG quality, or just shift where the complexity lives (now you debug the optimizer instead of the prompt)? - How do you handle the training data requirement? DSPy needs labeled examples to optimize against. Did you generate them via GPT-4 as a teacher, or do you have human-labeled QA pairs? - Is the optimization stable across document set updates? If you add 10K new docs, do you need to re-optimize? We're at the point where manual prompt tweaking gives diminishing returns. Looking for real production experience, not tutorial results.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.