https://arxiv.org/abs/2509.19249
RLPT (Reinforcement Learning on Pre-Training data) introduces a clever way to apply reinforcement learning to unlabeled text without needing human annotations. Here's the concrete mechanism:
Instead of training models to predict the next token (traditional pre-training), RLPT trains models to predict the next segment (typically a sentence) through explicit reasoning.
The setup:
[s₁, s₂, s₃, ..., sₙ](context, target_segment, next_segment)ASR (Autoregressive Segment Reasoning):
s<isᵢMSR (Middle Segment Reasoning):
s<i AND following segment sᵢ₊₁sᵢThe key innovation is deriving rewards without human labels: