Chat

This paper introduces a fascinating perspective on how Reinforcement Learning with Verifiable Rewards (RLVR) actually works. Let me break down the core technique and insight:

The Core Problem

The researchers address a paradox: When language models are trained with RLVR (like in OpenAI's o1 or DeepSeek-R1), they often perform worse on Pass@K metrics (finding at least one correct answer in K attempts) compared to their base models. This led many to believe RLVR doesn't actually improve reasoning—it just reshuffles existing capabilities.

The Key Insight: It's About Reasoning Quality, Not Just Answers

The authors argue that Pass@K is a flawed metric because it only checks if the final answer is correct, not whether the reasoning is sound. They discovered that base LLMs often get correct answers through:

The Core Technique: CoT-Pass@K

They introduce a new evaluation metric called CoT-Pass@K that requires BOTH:

  1. The chain-of-thought (reasoning process) to be correct
  2. The final answer to be correct

How They Implement It:

  1. Automated Verification: They use a powerful LLM (DeepSeek-R1) as a judge to verify if reasoning chains are logically sound and complete
  2. Multiple Verification Strategies: To handle potential errors from the LLM judge, they verify each chain of thought multiple times and use three strategies:

The Mathematical Foundation

The paper provides a theorem showing that RLVR algorithms (specifically GRPO) implicitly incentivize correct reasoning because: