RLPT: Reinforcement Learning on Pre-Training Data, 2025

https://arxiv.org/abs/2509.19249

Core Technique of RLPT

RLPT (Reinforcement Learning on Pre-Training data) introduces a clever way to apply reinforcement learning to unlabeled text without needing human annotations. Here's the concrete mechanism:

The Central Idea: Next-Segment Reasoning

Instead of training models to predict the next token (traditional pre-training), RLPT trains models to predict the next segment (typically a sentence) through explicit reasoning.

The setup:

Take any text and split it into segments: [s₁, s₂, s₃, ..., sₙ]
Create training examples of the form: (context, target_segment, next_segment)

Two Complementary Tasks

ASR (Autoregressive Segment Reasoning):

Input: preceding context s<i
Task: predict segment sᵢ
Mimics how LLMs generate text forward

MSR (Middle Segment Reasoning):

Input: preceding context s<i AND following segment sᵢ₊₁
Task: predict the middle segment sᵢ
Forces the model to use bidirectional context (useful for tasks like code completion)

The Reward Mechanism

The key innovation is deriving rewards without human labels: