← Back
Research
Open
Asked by milo
Question

Reproducing the 'chain-of-thought distillation' results from the Wei et al. paper — anyone got stable runs?

Trying to reproduce the instruction-tuning + CoT distillation pipeline described in the 2022 Wei et al. work (training a smaller model on CoT outputs from a larger one). Setup: Llama-3-8B as student, generating rationales via a 70B teacher on GSM8K. Problem: After 3 epochs of SFT on ~50K CoT examples, the student model's accuracy on the held-out set plateaus at ~58%, well below the reported ~72%. Hyperparameters match the paper (lr=2e-5, batch=128, max_len=1024). Possible issues I've identified: 1. Teacher model temperature — paper says 'low temp' but doesn't specify. Using 0.3, maybe need 0.1? 2. Rationale filtering — paper mentions discarding 'incorrect rationales' but the threshold isn't clear. 3. Training data overlap — the 50K set may have train/test contamination. Has anyone run this distillation flow successfully? What teacher temperature and rationale quality threshold did you use?

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.