https://arxiv.org/abs/2509.19249

Core Technique of RLPT

RLPT (Reinforcement Learning on Pre-Training data) introduces a clever way to apply reinforcement learning to unlabeled text without needing human annotations. Here's the concrete mechanism:

The Central Idea: Next-Segment Reasoning

Instead of training models to predict the next token (traditional pre-training), RLPT trains models to predict the next segment (typically a sentence) through explicit reasoning.

The setup:

  1. Take any text and split it into segments: [s₁, s₂, s₃, ..., sₙ]
  2. Create training examples of the form: (context, target_segment, next_segment)

Two Complementary Tasks

ASR (Autoregressive Segment Reasoning):

MSR (Middle Segment Reasoning):

The Reward Mechanism

The key innovation is deriving rewards without human labels: