Paper

Claude

The paper is built as a series of controlled experiments that first demonstrate why every standard approach fails, then show POPE working, then ablate to understand why it works.

Setup

All experiments start from Qwen3-4B-Instruct as the base model. They curated a set of hard math reasoning problems from DAPO, OmniMath (levels 5-8), and AceReason datasets where the base model fails to produce any successful rollout with large parallel sampling (k=128) and under a large token budget (32k). This definition of "hard" matters — these are problems that are provably unsolvable by the starting model, not just low-accuracy.


Experiment 1: Classical exploration bonuses fail

They ran on-policy RL on the hard set with two standard exploration add-ons:

  1. a token-level entropy bonus
  2. following DAPO, a more generous importance ratio clipping term in a PPO-style policy gradient update allowing the LLM to explore more aggressively on rare, off-policy rollouts

Incorporating an entropy bonus or utilizing a larger clip ratio eps_high both increase the average token-level entropy of the trained model to substantially large values. All of these approaches end up solving a similar number of problems, with no clear signs of improved solvability of the harder problems. Entropy explodes, stability collapses, but the model doesn't actually reach any new correct solutions.


Experiment 2: Curriculum / easy-to-hard transfer fails (ray interference)They mixed easy problems into the hard training set, hoping behaviors would transfer. The model's pass@k rate (solvability) on the hard training set of problems increases faster with easy problems mixed in, but still plateaus at a lower asymptote value than the performance obtained by training on the hard set only. Adding easy data actually makes the final ceiling worse. A similar trend appears when slightly easier problems are mixed in for RL training ("hard + easier") — in fact, this run is able to solve even fewer problems than "hard + easy" during training.

The diagnosis is ray interference: gradient updates from easy problems (where reward is plentiful) dominate and actively push the policy away from directions needed for hard problems.


Experiment 3: SFT warmstart also fails

They prompted the base model with a Gemini-generated partial solution, sampled many responses conditioned on this guidance, filtered them to retain only correct traces, performed SFT on this filtered data, and used the resulting model as initialization for RL.