https://arxiv.org/abs/2601.16175

MCTS-esque for max reward against one task. no attention to generalization. However, use a PG algorithm to update the weights as we go.

The PG algorithm is a version of normal policy gradient except we are maximizing E[f(r)] where f(r) is defined dynamically. f(r) is designed to be max-seeking (weight higher rewards much higher than lower rewards), but within a certain KL budget and also we restrict our overall policy with a KL regularization towards the reference policy.

We use PUCT (standard MCTS exploration/exploitation) to guide search, archiving all past states and using that to determine future rollouts.

Achieves SOTA on a bunch of different tasks.

Core Idea

Instead of just prompting a frozen LLM many times to solve a hard problem (like AlphaEvolve does), fine-tune the LLM's weights in real time on that single problem using reinforcement learning. The model learns from its own failed/partial attempts, getting better at this specific problem as it goes.

Why This Is Different from Normal RL

Standard RL TTT-Discover
Goal: maximize average reward Goal: find one great solution
Policy must generalize to new problems Only needs to solve this problem
The trained policy is the output The best solution found is the output

The Method (two key pieces)

1. Entropic Objective — Instead of maximizing expected reward, maximize $\log \mathbb{E}[e^{\beta \cdot R}]$. As $\beta \to \infty$ this approaches $\max(R)$. They set $\beta$ adaptively per state (via a KL constraint) so it's stable early and aggressive late.

2. PUCT-based Reuse — Maintain a buffer of past solutions. Pick which one to continue from using a PUCT-style score (like AlphaZero's tree search), but use the max child reward instead of the mean — because they only care about the best outcome from a state, not the average.

Training Loop

For 50 steps:
  1. Sample 512 rollouts (8 groups × 64), each conditioned on a
     state chosen via PUCT from the buffer
  2. Compute rewards → adaptive entropic advantages
  3. One LoRA gradient step on gpt-oss-120b
  4. Add new solutions to buffer
Return the best solution ever found

Results (all with open-weight gpt-oss-120b, ~$500/problem)

Domain Problem Prior SOTA TTT-Discover
Math Erdős min overlap (↓) 0.380924 (AlphaEvolve) 0.380876
Math Autocorr inequality C₁ (↓) 1.50314 1.50287
Kernels TriMul H100 runtime (↓) 1371 µs (best human) 1161 µs
Kernels TriMul A100 (↓) 4531 µs 2198 µs (2× faster)
Algorithms AtCoder AHC039 (↑) 558,026 (AI) / 566,997 (human) 567,062 (1st place)
Biology Single-cell denoising (↑) 0.64 0.71

Key Takeaway

Search alone (Best-of-N, evolutionary methods) keeps the LLM frozen — it never internalizes what it learned. TTT-Discover lets the model actually learn during the search, using its own attempts as the most relevant possible training data for an out-of-distribution problem.