Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs, ByteDance 2025

The Core Problem

The researchers address a paradox: When language models are trained with RLVR (like in OpenAI's o1 or DeepSeek-R1), they often perform worse on Pass@K metrics (finding at least one correct answer in K attempts) compared to their base models. This led many to believe RLVR doesn't actually improve reasoning—it just reshuffles existing capabilities.

The Key Insight: It's About Reasoning Quality, Not Just Answers

The authors argue that Pass@K is a flawed metric because it only checks if the final answer is correct, not whether the reasoning is sound. They discovered that base LLMs often get correct answers through:

Lucky guesses
Flawed reasoning that coincidentally works
Incomplete logic chains

The Core Technique: CoT-Pass@K

They introduce a new evaluation metric called CoT-Pass@K that requires BOTH:

The chain-of-thought (reasoning process) to be correct
The final answer to be correct

How They Implement It:

Automated Verification: They use a powerful LLM (DeepSeek-R1) as a judge to verify if reasoning chains are logically sound and complete
Multiple Verification Strategies: To handle potential errors from the LLM judge, they verify each chain of thought multiple times and use three strategies:
- Any-correct: Accept if at least one verification says it's correct
- All-correct: Accept only if all verifications agree it's correct
- Majority-correct: Accept based on majority vote

The Mathematical Foundation

The paper provides a theorem showing that RLVR algorithms (specifically GRPO) implicitly incentivize correct reasoning because:

Correct reasoning chains have higher probability of producing correct answers (α) than incorrect chains (β)
The GRPO advantage function naturally gives positive rewards to responses with correct reasoning
Over training, this increases the probability of generating correct reasoning chains