-
KL Regularized RL is designed to mode collapse, 2025
-
RLTF (Reinforcement Learning from Text Feedback), 2026
-
Learning from Negative Feedback, or Positive Feedback, or Both (PMPO), GDM 2024

-
Privileged Information Distillation for Language Models (π-Distill)
-
Reinforcement Learning via Self-Distillation
-
POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration, 2026
- Provide prefixes of golden solutions
-
Self-distilled reasoner, 2025
-
Self-Distillation Enables Continual Learning
-
Tapered off policy reinforce
- Just incorporate clipped importance ratio for negatives, not positives
- And keep pos ratio at 10-20%
-
PretrainZero
-
RLP: Reinforcement Learning as a Pretraining Objective, 2026
- randomly choose tokens to predict via sampled thoughts and see how much log prob is improved - no problem set needed, just use PT data
-
RLPT: Reinforcement Learning on Pre-Training Data, 2025
- predict segments, reward=1 iff semantically equiv
-
AGRO
-
Asymmetric REINFORCE, FAIR 2025
-
Group Sequence Policy Optimization, Bytedance 2025