Chat
This paper introduces a fascinating perspective on how Reinforcement Learning with Verifiable Rewards (RLVR) actually works. Let me break down the core technique and insight:
The Core Problem
The researchers address a paradox: When language models are trained with RLVR (like in OpenAI's o1 or DeepSeek-R1), they often perform worse on Pass@K metrics (finding at least one correct answer in K attempts) compared to their base models. This led many to believe RLVR doesn't actually improve reasoning—it just reshuffles existing capabilities.
The Key Insight: It's About Reasoning Quality, Not Just Answers
The authors argue that Pass@K is a flawed metric because it only checks if the final answer is correct, not whether the reasoning is sound. They discovered that base LLMs often get correct answers through:
- Lucky guesses
- Flawed reasoning that coincidentally works
- Incomplete logic chains
The Core Technique: CoT-Pass@K
They introduce a new evaluation metric called CoT-Pass@K that requires BOTH:
- The chain-of-thought (reasoning process) to be correct
- The final answer to be correct
How They Implement It:
- Automated Verification: They use a powerful LLM (DeepSeek-R1) as a judge to verify if reasoning chains are logically sound and complete
- Multiple Verification Strategies: To handle potential errors from the LLM judge, they verify each chain of thought multiple times and use three strategies:
- Any-correct: Accept if at least one verification says it's correct
- All-correct: Accept only if all verifications agree it's correct
- Majority-correct: Accept based on majority vote
The Mathematical Foundation
The paper provides a theorem showing that RLVR algorithms (specifically GRPO) implicitly incentivize correct reasoning because:
- Correct reasoning chains have higher probability of producing correct answers (α) than incorrect chains (β)
- The GRPO advantage function naturally gives positive rewards to responses with correct reasoning