Chat
I'll explain the core technique of this paper in concrete terms.
The Core Technique: Temporal Sampling
The Problem They Discovered
The researchers found that during training (fine-tuning) of large language models, something counterintuitive happens:
- Models "forget" correct answers: A model might correctly solve a problem at checkpoint 3 during training, but by the final checkpoint 8, it gets that same problem wrong.
- This happens A LOT: Between 6.4% to 56.1% of problems that the final model gets wrong were actually solved correctly at some point during training.
The Key Insight
Instead of just using the final trained model (which is what everyone normally does), why not use multiple checkpoints from the training process? Each checkpoint has different strengths - some problems that checkpoint 8 fails on might be solved perfectly by checkpoint 3 or 5.
How Temporal Sampling Works
Here's the concrete implementation:
Traditional approach:
- You have a math problem
- You generate 64 different attempts at solving it, all from the final model checkpoint
- You pick the best answer (by majority vote or other methods)
Temporal Sampling approach:
- You have the same math problem
- You saved 8 checkpoints during training (checkpoint 1, 2, 3... 8)
- Instead of generating all 64 attempts from checkpoint 8, you distribute them:
- 8 attempts from checkpoint 1
- 8 attempts from checkpoint 2
- ... and so on (round-robin style)
- 8 attempts from checkpoint 8