Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, 2024

Sequential revisions (improves easy math problems more, beats parallel because model refines / learns from each) vs parallel revisions (better for hard which need fundamentally diff approaches)
Different search algorithms using the verifier
- Best-of-N: Generate N complete solutions, pick best one
- Beam search: Build solutions step-by-step, keeping top candidates at each step
- Lookahead search: Like beam search but looks k steps ahead
The compute-optimal strategy applies to BOTH:
- For revisions: Choose ratio of sequential/parallel based on difficulty
- For PRM search: Choose between best-of-N vs beam search based on difficulty

Core Technique: Compute-Optimal Test-Time Scaling

The key insight is that different problems benefit from different test-time computation strategies, and by adaptively choosing the right strategy based on problem difficulty, we can achieve much better results with less compute.

The Two Main Mechanisms:

Improving the Proposal Distribution (Revisions)
- Train the model to iteratively revise its own answers
- Each revision is conditioned on previous attempts
- Like having the model "think again" about its answer
Optimizing with Verifiers (Search)
- Use Process Reward Models (PRMs) that score each step of a solution
- Apply search algorithms (best-of-N, beam search, etc.) to find better solutions
- Like exploring different solution paths and picking the best one

The Adaptive Strategy:

The paper's key contribution is showing that:

Easy problems → Sequential revisions work best (the model just needs to refine its initial approach)
Medium problems → A mix of revisions and parallel sampling
Hard problems → Parallel search with verifiers (need to explore fundamentally different approaches)

Core Technique: Compute-Optimal Test-Time Scaling

The Two Main Mechanisms:

The Adaptive Strategy:

Concrete Example: Math Problem Solving