RLP: Reinforcement Learning as a Pretraining Objective, 2026

The Core Problem

Standard LLM training is a two-stage pipeline: first, pretrain on next-token prediction (NTP) over trillions of tokens, then bolt on reasoning ability via RL post-training (RLHF/RLVR) on a small curated dataset. The authors ask: why wait until the end? Can we teach the model to think during pretraining itself?

The Mechanism (what actually happens during training)

Imagine a training document contains: "Photosynthesis is the process plants use to make food using ___" where the next token is "sunlight."

Standard NTP: Model sees the prefix → predicts next token → cross-entropy loss. Done.

RLP: The model does something extra at a randomly chosen position:

Sample a thought. Given the prefix, the model generates a short chain-of-thought like "The sentence is about how plants make food. This requires energy from the sun. So the next token is probably 'sunlight.'"
Compute two log-probabilities for the true next token:
- $S_{\text{pred}}$ = log-probability of "sunlight" given (prefix + thought) — the reasoned scorer
- $S_{\text{EMA}}$ = log-probability of "sunlight" given (prefix alone) — the no-think baseline, computed by a slowly-updating EMA copy of the model
The reward is the difference: $$r(c_t) = S_{\text{pred}} - S_{\text{EMA}}$$ This is literally: "how much did the thought help you predict the actual next token?" Positive → the thought was useful. Negative → it hurt.
Update only the thought tokens using a GRPO-style clipped surrogate. They sample G=16 thoughts per position, compute group-relative advantages (subtract the group mean, rescale), and backprop through the thought generation — not through the prediction heads.
Update the EMA baseline slowly: $\phi \leftarrow 0.999\phi + 0.001\theta$.

Why This is Clever

Verifier-free. No answer checkers, no human labels, no curated datasets. The reward is just a log-likelihood ratio computed from the model itself. Works on raw web text.
Dense reward. Every position gives a scalar signal, unlike RPT (prior work) which only rewards a sparse binary "did you predict the exact token."
The EMA is the trick that prevents hacking. If you used a frozen baseline, the model could trivially game it. If you used the current model as baseline, the reward would collapse to zero. The 0.999-lagged EMA stays just behind the student, so the model must keep finding thoughts that genuinely improve prediction.

Practical Details

They apply RLP to one random token per document, not every token (otherwise compute would explode).
Hyperparameters: 16 rollouts, thought length 2048, no KL penalty.
Wall-clock: ~2.25× slower per step than SFT, but far more data-efficient.

The Core Problem

The Mechanism (what actually happens during training)

Why This is Clever

Practical Details

Results (the headline numbers)