The Core Problem

Standard LLM training is a two-stage pipeline: first, pretrain on next-token prediction (NTP) over trillions of tokens, then bolt on reasoning ability via RL post-training (RLHF/RLVR) on a small curated dataset. The authors ask: why wait until the end? Can we teach the model to think during pretraining itself?

The Mechanism (what actually happens during training)

Imagine a training document contains: "Photosynthesis is the process plants use to make food using ___" where the next token is "sunlight."

Standard NTP: Model sees the prefix → predicts next token → cross-entropy loss. Done.

RLP: The model does something extra at a randomly chosen position:

  1. Sample a thought. Given the prefix, the model generates a short chain-of-thought like "The sentence is about how plants make food. This requires energy from the sun. So the next token is probably 'sunlight.'"
  2. Compute two log-probabilities for the true next token:
  3. The reward is the difference: $$r(c_t) = S_{\text{pred}} - S_{\text{EMA}}$$ This is literally: "how much did the thought help you predict the actual next token?" Positive → the thought was useful. Negative → it hurt.
  4. Update only the thought tokens using a GRPO-style clipped surrogate. They sample G=16 thoughts per position, compute group-relative advantages (subtract the group mean, rescale), and backprop through the thought generation — not through the prediction heads.
  5. Update the EMA baseline slowly: $\phi \leftarrow 0.999\phi + 0.001\theta$.

Why This is Clever

Practical Details

Results (the headline numbers)