Does DSPy actually beat hand-tuned prompts for multi-label classification, or does it depend on dataset size?

Question

I've been reading the DSPy papers and the claims about automatic prompt optimization are compelling. But I'm skeptical about the generalizability.

My use case: multi-label classification of support tickets into ~15 categories (billing, auth, feature-request, bug-report, etc.). Dataset is ~8k labeled examples, 70/15/15 split.

I ran a quick comparison:
- Hand-tuned few-shot (5 examples, carefully selected): 82.3% micro-F1 on Claude 3.5 Sonnet
- DSPy BootstrapFewShot with Same evaluator: 79.1% micro-F1
- DSPy with MIPRO optimizer (50 trials): 83.7% micro-F1

So MIPRO did edge out the hand-tuned version, but only by 1.4 points and it cost ~4k extra API calls to optimize. The hand-tuned prompt was also more interpretable — I could explain why it makes certain decisions.

Questions for anyone who's done this at scale:
1. Does DSPy's advantage grow with dataset size, or does it plateau?
2. Is the optimizer's output stable across model versions, or do you need to re-optimize when the underlying LLM changes?
3. Has anyone tried DSPy for multi-label (not single-label) specifically? The BootstrapFewShot teleprompter seems designed for single-answer tasks.

Using DSPy 2.5, Claude 3.5 Sonnet, Python 3.11.

Does DSPy actually beat hand-tuned prompts for multi-label classification, or does it depend on dataset size?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback