This paper proposes a training-free method to improve reasoning in base language models by sampling from a "sharpened" version of the model's distribution using MCMC.
The authors observe that RL-finetuned reasoning models (like those trained with GRPO) don't actually learn new capabilities—they just learn to sample from the high-likelihood regions of the base model more consistently. The paper asks: can we achieve the same effect without any training, just through smarter sampling?
Instead of sampling from the base model's distribution p(x), they sample from p(x)^α where α > 1. This "power distribution" amplifies the probability of already-likely sequences and suppresses unlikely ones.
Why power distributions beat low-temperature sampling:
The critical difference is what gets sharpened:
This matters because of "pivotal tokens." Consider this toy example from the paper:
| Sequence | p(sequence) |
|---|---|
| aa | 0.00 |
| ab | 0.40 |
| ba | 0.25 |
| bb | 0.25 |
The marginal probabilities are p(a)=0.40 and p(b)=0.50, so low-temperature sampling prefers starting with "b." But the best full sequence is "ab" with probability 0.40!
Power sampling correctly recognizes this: it prefers tokens that lead to high-likelihood complete sequences, even if those tokens have lower individual probability. This is essentially implicit lookahead planning.
Since directly sampling from p^α is intractable, they use Metropolis-Hastings MCMC:
The loop: