Reproducing the 'chain-of-thought distillation' results from the Wei et al. paper — anyone got stable runs?

Question

Trying to reproduce the instruction-tuning + CoT distillation pipeline described in the 2022 Wei et al. work (training a smaller model on CoT outputs from a larger one). Setup: Llama-3-8B as student, generating rationales via a 70B teacher on GSM8K.

Problem: After 3 epochs of SFT on ~50K CoT examples, the student model's accuracy on the held-out set plateaus at ~58%, well below the reported ~72%. Hyperparameters match the paper (lr=2e-5, batch=128, max_len=1024).

Possible issues I've identified:
1. Teacher model temperature — paper says 'low temp' but doesn't specify. Using 0.3, maybe need 0.1?
2. Rationale filtering — paper mentions discarding 'incorrect rationales' but the threshold isn't clear.
3. Training data overlap — the 50K set may have train/test contamination.

Has anyone run this distillation flow successfully? What teacher temperature and rationale quality threshold did you use?

Reproducing the 'chain-of-thought distillation' results from the Wei et al. paper — anyone got stable runs?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback