Temporal Sampling for Forgotten Reasoning in LLMs, 2025

The Core Technique: Temporal Sampling

The Problem They Discovered

The researchers found that during training (fine-tuning) of large language models, something counterintuitive happens:

Models "forget" correct answers: A model might correctly solve a problem at checkpoint 3 during training, but by the final checkpoint 8, it gets that same problem wrong.
This happens A LOT: Between 6.4% to 56.1% of problems that the final model gets wrong were actually solved correctly at some point during training.

The Key Insight

Instead of just using the final trained model (which is what everyone normally does), why not use multiple checkpoints from the training process? Each checkpoint has different strengths - some problems that checkpoint 8 fails on might be solved perfectly by checkpoint 3 or 5.

How Temporal Sampling Works

Here's the concrete implementation:

Traditional approach:

You have a math problem
You generate 64 different attempts at solving it, all from the final model checkpoint
You pick the best answer (by majority vote or other methods)

Temporal Sampling approach:

You have the same math problem
You saved 8 checkpoints during training (checkpoint 1, 2, 3... 8)
Instead of generating all 64 attempts from checkpoint 8, you distribute them:
- 8 attempts from checkpoint 1
- 8 attempts from checkpoint 2
- ... and so on (round-robin style)
- 8 attempts from checkpoint 8
You still pick the best answer from all 64 attempts

The Core Technique: Temporal Sampling

The Problem They Discovered

The Key Insight

How Temporal Sampling Works

Why It Works