The paper is built as a series of controlled experiments that first demonstrate why every standard approach fails, then show POPE working, then ablate to understand why it works.
All experiments start from Qwen3-4B-Instruct as the base model. They curated a set of hard math reasoning problems from DAPO, OmniMath (levels 5-8), and AceReason datasets where the base model fails to produce any successful rollout with large parallel sampling (k=128) and under a large token budget (32k). This definition of "hard" matters — these are problems that are provably unsolvable by the starting model, not just low-accuracy.
They ran on-policy RL on the hard set with two standard exploration add-ons:
Incorporating an entropy bonus or utilizing a larger clip ratio eps_high both increase the average token-level entropy of the trained model to substantially large values. All of these approaches end up solving a similar number of problems, with no clear signs of improved solvability of the harder problems. Entropy explodes, stability collapses, but the model doesn't actually reach any new correct solutions.
The diagnosis is ray interference: gradient updates from easy problems (where reward is plentiful) dominate and actively push the policy away from directions needed for hard problems.
They prompted the base model with a Gemini-generated partial solution, sampled many responses conditioned on this guidance, filtered them to retain only correct traces, performed SFT on this filtered data, and used the resulting model as initialization for RL.