Reinforcement Learning Teachers of Test Time Scaling, Sakana 2025

Rather than RL to solve problems, RL a teacher that, given problem+solution, generates the explanation/COT for a student model to distill. Reward is how much a student model’s log probs understand.