Chat

I'll explain the core technique of this paper in concrete terms.

The Core Technique: Temporal Sampling

The Problem They Discovered

The researchers found that during training (fine-tuning) of large language models, something counterintuitive happens:

  1. Models "forget" correct answers: A model might correctly solve a problem at checkpoint 3 during training, but by the final checkpoint 8, it gets that same problem wrong.
  2. This happens A LOT: Between 6.4% to 56.1% of problems that the final model gets wrong were actually solved correctly at some point during training.

The Key Insight

Instead of just using the final trained model (which is what everyone normally does), why not use multiple checkpoints from the training process? Each checkpoint has different strengths - some problems that checkpoint 8 fails on might be solved perfectly by checkpoint 3 or 5.

How Temporal Sampling Works

Here's the concrete implementation:

Traditional approach:

Temporal Sampling approach: