Sampling from $p(x)^T$ (just sharpening the model's distribution, lowering entropy) performs nearly as well as RL, which explicitly tilts the distribution toward high-reward answers. The implication is that a lot of what RL "does" might just be entropy reduction — teaching the model to commit to its own best guesses — rather than actually teaching it new correct behaviors.
The factorization problem
Under the base model, $p(y) = \prod_t p(y_t \mid y_{<t})$. So:
$$p(y)^\alpha = \prod_t p(y_t \mid y_{<t})^\alpha$$
This looks like you could just sample each token at temperature $1/\alpha$. But you can't — because renormalizing each conditional independently gives a different distribution than renormalizing the whole product. The correct token-level conditional under $p^\alpha$ is:
$$q(y_t \mid y_{<t}) \propto p(y_t \mid y_{<t})^\alpha \cdot \underbrace{\sum_{\text{all futures } y_{>t}} p(y_{>t} \mid y_{\le t})^\alpha}{Z(y{\le t}) \text{ — intractable}}$$
That $Z$ term is a sum over every possible continuation. You can't compute it. So there are two strategies:
Strategy 1: MCMC (prior work — Karan & Du 2025)
Don't compute $Z$ at all. Just run Metropolis-Hastings on full sequences:
After enough steps, you're sampling from $p^\alpha$. It's provably correct but slow — each step regenerates a suffix, and you need many steps to mix. This is what the paper is trying to beat on speed.
Strategy 2: Rollout estimation (this paper — Ji et al. 2026)