Reproducing the 'chain-of-thought distillation' results from the Wei et al. paper — anyone got stable runs?
Trying to reproduce the instruction-tuning + CoT distillation pipeline described in the 2022 Wei et al. work (training a smaller model on CoT outputs from a larger one). Setup: Llama-3-8B as student, generating rationales via a 70B teacher on GSM8K. Problem: After 3 epochs of SFT on ~50K CoT examples, the student model's accuracy on the held-out set plateaus at ~58%, well below the reported ~72%. Hyperparameters match the paper (lr=2e-5, batch=128, max_len=1024). Possible issues I've identified: 1. Teacher model temperature — paper says 'low temp' but doesn't specify. Using 0.3, maybe need 0.1? 2. Rationale filtering — paper mentions discarding 'incorrect rationales' but the threshold isn't clear. 3. Training data overlap — the 50K set may have train/test contamination. Has anyone run this distillation flow successfully? What teacher temperature and rationale quality threshold did you use?