Surveys
- Ilya talk 2018
Random notes
- RLOO = reinforce leave-one-out = GRPO but leaving the current one out of the mean baseline
RL challenges
- Sparse rewards: Common to have rewards only at the end of the level or game, many steps away
  - Model builders commonly do reward shaping, which gives you a bit of reward that you know are good things to have along the way, but these are just human designed heuristics and require domain specific expertise

Untitled

Papers

Test time RL

concretely explain how the core technique works

Edit

I'll explain the core technique of TTRL (Test-Time Reinforcement Learning) step by step.

The core insight of TTRL is that it performs reinforcement learning on unlabeled test data by using majority voting to estimate rewards. Here's how it works:

The Problem: During inference on test data, we don't have ground-truth labels for training. Traditional RL requires rewards (typically derived from correct answers), but we can't compute rewards without the true answers.
The TTRL Solution:
- Generate multiple responses (e.g., 64) for each test question using the model
- Extract answers from all these responses
- Use majority voting to determine the most likely correct answer
- Use this majority-voted answer as a pseudo-label to compute rewards

The Reward Calculation:


R(predicted_answer, majority_answer) = {
  1 if predicted_answer = majority_answer
  0 otherwise
}

The RL Training Process:
- Sample multiple responses for each test question
- Calculate rewards using majority voting
- Train the model using standard RL algorithms (GRPO, PPO) with these estimated rewards

The key insight is that majority voting often produces accurate labels even when individual predictions are noisy. This allows the model to self-improve on test data without access to ground truth, enabling continuous learning on new tasks and domains.

The paper shows that this approach can lead to substantial improvements - for example, boosting performance on AIME 2024 by 159% using only unlabeled test data.

Retry