Paper
Chat
- Sequential revisions (improves easy math problems more, beats parallel because model refines / learns from each) vs parallel revisions (better for hard which need fundamentally diff approaches)
- Different search algorithms using the verifier
- Best-of-N: Generate N complete solutions, pick best one
- Beam search: Build solutions step-by-step, keeping top candidates at each step
- Lookahead search: Like beam search but looks k steps ahead
- The compute-optimal strategy applies to BOTH:
- For revisions: Choose ratio of sequential/parallel based on difficulty
- For PRM search: Choose between best-of-N vs beam search based on difficulty

Core Technique: Compute-Optimal Test-Time Scaling
The key insight is that different problems benefit from different test-time computation strategies, and by adaptively choosing the right strategy based on problem difficulty, we can achieve much better results with less compute.
The Two Main Mechanisms:
- Improving the Proposal Distribution (Revisions)
- Train the model to iteratively revise its own answers
- Each revision is conditioned on previous attempts
- Like having the model "think again" about its answer
- Optimizing with Verifiers (Search)
- Use Process Reward Models (PRMs) that score each step of a solution
- Apply search algorithms (best-of-N, beam search, etc.) to find better solutions
- Like exploring different solution paths and picking the best one
The Adaptive Strategy:
The paper's key contribution is showing that:
- Easy problems → Sequential revisions work best (the model just needs to refine its initial approach)
- Medium problems → A mix of revisions and parallel sampling
- Hard problems → Parallel search with verifiers (need to explore fundamentally different approaches)
Concrete Example: Math Problem Solving