- See also Deep/formal reasoning
- References
- Concepts
- pass@k: choose answer based on ground truth
- maj@K: choose answer via voting
- rerank@K: choose answer use ranking model
- Scalable power sampling
- Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use, 2025
- Learning to discover at test time, 2025
- MCTS-esque for max reward against one task. no attention to generalization. However, use a PG algorithm to update the weights as we go.
- RARO, 2025
- Rewards can be hard to create, but you have expert demos. Instead of SFT on expert demonstrations, RL with reward determined by a critic that distinguishes policy responses from expert responses. Shared weights, pairwise comparisons, trained together.
- Reasoning with Sampling, 2025
- Find most likely (by joint prob) sequence by regenerating from intermediate points and using Metropolis Hastings acceptance ratio for P(seq)^alpha (base model is proposal dist, P(seq))
- GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning, Databricks 2025
- Wider or Deeper? Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search, 2025
- Tree search that dynamically chooses between deepening vs broadening
- Energy-Based Transformers are Scalable Learners and Thinkers, 2025
- Take multiple gradient steps optimizing energy function in both train & inf, with respect to the token dist
- Temporal Sampling for Forgotten Reasoning in LLMs, 2025
- Use earlier snapshots as well as final, since they can solve problems we forget how to solve
- Reinforcement Learning Teachers of Test Time Scaling, Sakana 2025
- Generate COT given solution for distillation, rewarded by student perplexity
- Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search, 2025
- Break down COTs into steps, RL to continue/reflect/explore alt branches at each step, and use random rewinds of both successful and unsuccessful rollouts
- Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs, ByteDance 2025
- With COT-pass@k metric that uses LLM as PRM, found that RL does improve COTs
- Reasoning to Learn from Latent Thoughts, Stanford 2025
- Learn to generate additional thought text interspersed with original (learn using EM). Focused on MATH though where not clearly different from normal reasoning?
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, 2025
- Hierarchical Reasoning Model, 2025
- RNN (with xformer blocks) with inner and outer loop blocks forming 2 levels
- also just trains single layer, no backprop through time
- Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models (LATS), 2024
- Test-time MCTS with self-eval, failure reflections
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, 2024
- Seq revisions vs parallel BON vs beam/tree search