- Surveys
- Random notes
- RLOO = reinforce leave-one-out = GRPO but leaving the current one out of the mean baseline
- RL challenges
- Sparse rewards: Common to have rewards only at the end of the level or game, many steps away
- Model builders commonly do reward shaping, which gives you a bit of reward that you know are good things to have along the way, but these are just human designed heuristics and require domain specific expertise

Papers
- Privileged Information Distillation for Language Models (π-Distill)
- Reinforcement Learning via Self-Distillation
- POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration, 2026
- Provide prefixes of golden solutions
- Self-distilled reasoner, 2025
- Self-Distillation Enables Continual Learning
- Tapered off policy reinforce
- Just incorporate clipped importance ratio for negatives, not positives
- And keep pos ratio at 10-20%
- PretrainZero
- RLP: Reinforcement Learning as a Pretraining Objective, 2026
- randomly choose tokens to predict via sampled thoughts and see how much log prob is improved - no problem set needed, just use PT data
- RLPT: Reinforcement Learning on Pre-Training Data, 2025
- predict segments, reward=1 iff semantically equiv
- AGRO
- Asymmetric REINFORCE, FAIR 2025
- Group Sequence Policy Optimization, Bytedance 2025
- Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems
- On a few pitfalls in KL divergence gradient estimation for RL, 2025
- All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning, 2025