PretrainZero | Notion

Core Technique of PretrainZero: Reinforcement Active Pretraining

PretrainZero's central innovation is a coupled adversarial learning framework with two interacting tasks that share a single LLM. Here's the concrete mechanism:

The Two-Task Setup

Task 1: Mask Generation Given a Wikipedia paragraph, the model generates a word span to mask. For example:

Input: "Ralph Hanover won seventeen additional [?] events..." Output: \\mask{stakes}

Task 2: Mask Prediction

Given the masked text, the model produces chain-of-thought reasoning to predict the masked content:

Input: "Ralph Hanover won seventeen additional [mask] events..." Output: [Thinking] Analyzing context... racing events... → \\boxed{stakes}

The Min-Max Adversarial Objective

The key insight is formulating this as a game between generator and predictor:

ω∗=arg⁡min⁡ω′max⁡ωEs∼D[Em∼πω′(⋅∣s),x^∼ψω(⋅∣m,s)[R(s,m,x^)]]\omega^* = \arg\min_{\omega'} \max_{\omega} \mathbb{E}{s \sim D}\left[\mathbb{E}{m \sim \pi_{\omega'}(\cdot|s), \hat{x} \sim \psi_\omega(\cdot|m,s)}[R(s, m, \hat{x})]\right]

ω∗=argω′minωmaxEs∼D[Em∼πω′(⋅∣s),x^∼ψω(⋅∣m,s)[R(s,m,x^)]]

Where:

Predictor reward: rpred=1 if prediction matches ground truth, else 0

rpred=1r_{pred} = 1
Generator reward: rgen=1−avg prediction accuracy

rgen=1−avg prediction accuracyr_{gen} = 1 - \text{avg prediction accuracy}

This creates opposing incentives:

The generator is rewarded for finding masks the predictor fails on
The predictor is rewarded for correctly recovering masked content