- Surveys
- Random notes
- RLOO = reinforce leave-one-out = GRPO but leaving the current one out of the mean baseline
- RL challenges
- Sparse rewards: Common to have rewards only at the end of the level or game, many steps away
- Model builders commonly do reward shaping, which gives you a bit of reward that you know are good things to have along the way, but these are just human designed heuristics and require domain specific expertise

Papers
Test time RL
concretely explain how the core technique works
Edit
I'll explain the core technique of TTRL (Test-Time Reinforcement Learning) step by step.
The core insight of TTRL is that it performs reinforcement learning on unlabeled test data by using majority voting to estimate rewards. Here's how it works:
-
The Problem: During inference on test data, we don't have ground-truth labels for training. Traditional RL requires rewards (typically derived from correct answers), but we can't compute rewards without the true answers.
-
The TTRL Solution:
- Generate multiple responses (e.g., 64) for each test question using the model
- Extract answers from all these responses
- Use majority voting to determine the most likely correct answer
- Use this majority-voted answer as a pseudo-label to compute rewards
-
The Reward Calculation:
R(predicted_answer, majority_answer) = {
1 if predicted_answer = majority_answer
0 otherwise
}
-
The RL Training Process:
- Sample multiple responses for each test question
- Calculate rewards using majority voting
- Train the model using standard RL algorithms (GRPO, PPO) with these estimated rewards