concretely explain how the core technique works
Edit
I'll explain the core technique of TTRL (Test-Time Reinforcement Learning) step by step.
The core insight of TTRL is that it performs reinforcement learning on unlabeled test data by using majority voting to estimate rewards. Here's how it works:
The Problem: During inference on test data, we don't have ground-truth labels for training. Traditional RL requires rewards (typically derived from correct answers), but we can't compute rewards without the true answers.
The TTRL Solution:
The Reward Calculation:
R(predicted_answer, majority_answer) = {
1 if predicted_answer = majority_answer
0 otherwise
}
The RL Training Process:
The key insight is that majority voting often produces accurate labels even when individual predictions are noisy. This allows the model to self-improve on test data without access to ground truth, enabling continuous learning on new tasks and domains.
The paper shows that this approach can lead to substantial improvements - for example, boosting performance on AIME 2024 by 159% using only unlabeled test data.
Retry
YZ
i don't get it, how is this test time RL? you mean we're just training the model on its own majority answer, if so why is that desirable? can you give a concrete example?
Edit