Paper

Sampling from $p(x)^T$ (just sharpening the model's distribution, lowering entropy) performs nearly as well as RL, which explicitly tilts the distribution toward high-reward answers. The implication is that a lot of what RL "does" might just be entropy reduction — teaching the model to commit to its own best guesses — rather than actually teaching it new correct behaviors.

The factorization problem

Under the base model, $p(y) = \prod_t p(y_t \mid y_{<t})$. So:

$$p(y)^\alpha = \prod_t p(y_t \mid y_{<t})^\alpha$$

This looks like you could just sample each token at temperature $1/\alpha$. But you can't — because renormalizing each conditional independently gives a different distribution than renormalizing the whole product. The correct token-level conditional under $p^\alpha$ is:

$$q(y_t \mid y_{<t}) \propto p(y_t \mid y_{<t})^\alpha \cdot \underbrace{\sum_{\text{all futures } y_{>t}} p(y_{>t} \mid y_{\le t})^\alpha}{Z(y{\le t}) \text{ — intractable}}$$

That $Z$ term is a sum over every possible continuation. You can't compute it. So there are two strategies:


Strategy 1: MCMC (prior work — Karan & Du 2025)

Don't compute $Z$ at all. Just run Metropolis-Hastings on full sequences:

  1. Sample an initial sequence $y$ from the base model.
  2. Propose a perturbation — typically pick a random position $t$ and resample the entire suffix from the base model.
  3. Compute the MH acceptance ratio. Since the proposal comes from the base model $p$, the $Z$'s cancel and you only need $p(y)^\alpha / p(y')^\alpha$, which is computable.
  4. Accept or reject. Repeat.

After enough steps, you're sampling from $p^\alpha$. It's provably correct but slow — each step regenerates a suffix, and you need many steps to mix. This is what the paper is trying to beat on speed.


Strategy 2: Rollout estimation (this paper — Ji et al. 2026)