https://arxiv.org/abs/2601.16175
MCTS-esque for max reward against one task. no attention to generalization. However, use a PG algorithm to update the weights as we go.
The PG algorithm is a version of normal policy gradient except we are maximizing E[f(r)] where f(r) is defined dynamically. f(r) is designed to be max-seeking (weight higher rewards much higher than lower rewards), but within a certain KL budget and also we restrict our overall policy with a KL regularization towards the reference policy.
We use PUCT (standard MCTS exploration/exploitation) to guide search, archiving all past states and using that to determine future rollouts.
Achieves SOTA on a bunch of different tasks.
Instead of just prompting a frozen LLM many times to solve a hard problem (like AlphaEvolve does), fine-tune the LLM's weights in real time on that single problem using reinforcement learning. The model learns from its own failed/partial attempts, getting better at this specific problem as it goes.
| Standard RL | TTT-Discover |
|---|---|
| Goal: maximize average reward | Goal: find one great solution |
| Policy must generalize to new problems | Only needs to solve this problem |
| The trained policy is the output | The best solution found is the output |
1. Entropic Objective — Instead of maximizing expected reward, maximize $\log \mathbb{E}[e^{\beta \cdot R}]$. As $\beta \to \infty$ this approaches $\max(R)$. They set $\beta$ adaptively per state (via a KL constraint) so it's stable early and aggressive late.
2. PUCT-based Reuse — Maintain a buffer of past solutions. Pick which one to continue from using a PUCT-style score (like AlphaZero's tree search), but use the max child reward instead of the mean — because they only care about the best outcome from a state, not the average.
For 50 steps:
1. Sample 512 rollouts (8 groups × 64), each conditioned on a
state chosen via PUCT from the buffer
2. Compute rewards → adaptive entropic advantages
3. One LoRA gradient step on gpt-oss-120b
4. Add new solutions to buffer
Return the best solution ever found
| Domain | Problem | Prior SOTA | TTT-Discover |
|---|---|---|---|
| Math | Erdős min overlap (↓) | 0.380924 (AlphaEvolve) | 0.380876 |
| Math | Autocorr inequality C₁ (↓) | 1.50314 | 1.50287 |
| Kernels | TriMul H100 runtime (↓) | 1371 µs (best human) | 1161 µs |
| Kernels | TriMul A100 (↓) | 4531 µs | 2198 µs (2× faster) |
| Algorithms | AtCoder AHC039 (↑) | 558,026 (AI) / 566,997 (human) | 567,062 (1st place) |
| Biology | Single-cell denoising (↑) | 0.64 | 0.71 |
Search alone (Best-of-N, evolutionary methods) keeps the LLM frozen — it never internalizes what it learned. TTT-Discover lets the model actually learn during the search, using its own attempts as the most relevant possible training data for an out-of-distribution problem.