The Core Problem
Standard LLM training is a two-stage pipeline: first, pretrain on next-token prediction (NTP) over trillions of tokens, then bolt on reasoning ability via RL post-training (RLHF/RLVR) on a small curated dataset. The authors ask: why wait until the end? Can we teach the model to think during pretraining itself?
The Mechanism (what actually happens during training)
Imagine a training document contains: "Photosynthesis is the process plants use to make food using ___" where the next token is "sunlight."
Standard NTP: Model sees the prefix → predicts next token → cross-entropy loss. Done.
RLP: The model does something extra at a randomly chosen position:
- Sample a thought. Given the prefix, the model generates a short chain-of-thought like "The sentence is about how plants make food. This requires energy from the sun. So the next token is probably 'sunlight.'"
- Compute two log-probabilities for the true next token:
- $S_{\text{pred}}$ = log-probability of "sunlight" given (prefix + thought) — the reasoned scorer
- $S_{\text{EMA}}$ = log-probability of "sunlight" given (prefix alone) — the no-think baseline, computed by a slowly-updating EMA copy of the model
- The reward is the difference:
$$r(c_t) = S_{\text{pred}} - S_{\text{EMA}}$$
This is literally: "how much did the thought help you predict the actual next token?" Positive → the thought was useful. Negative → it hurt.
- Update only the thought tokens using a GRPO-style clipped surrogate. They sample G=16 thoughts per position, compute group-relative advantages (subtract the group mean, rescale), and backprop through the thought generation — not through the prediction heads.
- Update the EMA baseline slowly: $\phi \leftarrow 0.999\phi + 0.001\theta$.
Why This is Clever
- Verifier-free. No answer checkers, no human labels, no curated datasets. The reward is just a log-likelihood ratio computed from the model itself. Works on raw web text.
- Dense reward. Every position gives a scalar signal, unlike RPT (prior work) which only rewards a sparse binary "did you predict the exact token."
- The EMA is the trick that prevents hacking. If you used a frozen baseline, the model could trivially game it. If you used the current model as baseline, the reward would collapse to zero. The 0.999-lagged EMA stays just behind the student, so the model must keep finding thoughts that genuinely improve prediction.
Practical Details
- They apply RLP to one random token per document, not every token (otherwise compute would explode).
- Hyperparameters: 16 rollouts, thought length 2048, no KL penalty.
- Wall-clock: ~2.25× slower per step than SFT, but far more data-efficient.
Results (the headline numbers)