Paper

Core Technique: Power Sampling via MCMC

This paper proposes a training-free method to improve reasoning in base language models by sampling from a "sharpened" version of the model's distribution using MCMC.

The Key Insight

The authors observe that RL-finetuned reasoning models (like those trained with GRPO) don't actually learn new capabilities—they just learn to sample from the high-likelihood regions of the base model more consistently. The paper asks: can we achieve the same effect without any training, just through smarter sampling?

The Target: Power Distributions

Instead of sampling from the base model's distribution p(x), they sample from p(x)^α where α > 1. This "power distribution" amplifies the probability of already-likely sequences and suppresses unlikely ones.

Why power distributions beat low-temperature sampling:

The critical difference is what gets sharpened:

This matters because of "pivotal tokens." Consider this toy example from the paper:

Sequence p(sequence)
aa 0.00
ab 0.40
ba 0.25
bb 0.25

The marginal probabilities are p(a)=0.40 and p(b)=0.50, so low-temperature sampling prefers starting with "b." But the best full sequence is "ab" with probability 0.40!

Power sampling correctly recognizes this: it prefers tokens that lead to high-likelihood complete sequences, even if those tokens have lower individual probability. This is essentially implicit lookahead planning.

The Algorithm: Autoregressive MCMC

Since directly sampling from p^α is intractable, they use Metropolis-Hastings MCMC:

The loop:

  1. Generate a candidate sequence by autoregressively sampling from the base model