• See also Deep/formal reasoning
  • References
    • https://lilianweng.github.io/posts/2023-06-23-agent/
    • Reasoning survey
    • https://github.com/atfortes/LLM-Reasoning-Papers?tab=readme-ov-file
    • https://arxiv.org/abs/2305.14992
    • https://evjang.com/2023/03/26/self-reflection.html
    • https://github.com/atfortes/Awesome-LLM-Reasoning
  • Concepts
    • pass@k: choose answer based on ground truth
    • maj@K: choose answer via voting
    • rerank@K: choose answer use ranking model
  • Scalable power sampling
  • Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use, 2025
  • Learning to discover at test time, 2025
    • MCTS-esque for max reward against one task. no attention to generalization. However, use a PG algorithm to update the weights as we go.
  • RARO, 2025
    • Rewards can be hard to create, but you have expert demos. Instead of SFT on expert demonstrations, RL with reward determined by a critic that distinguishes policy responses from expert responses. Shared weights, pairwise comparisons, trained together.
  • Reasoning with Sampling, 2025
    • Find most likely (by joint prob) sequence by regenerating from intermediate points and using Metropolis Hastings acceptance ratio for P(seq)^alpha (base model is proposal dist, P(seq))
  • GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning, Databricks 2025
  • Wider or Deeper? Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search, 2025
    • Tree search that dynamically chooses between deepening vs broadening
  • Energy-Based Transformers are Scalable Learners and Thinkers, 2025
    • Take multiple gradient steps optimizing energy function in both train & inf, with respect to the token dist
  • Temporal Sampling for Forgotten Reasoning in LLMs, 2025
    • Use earlier snapshots as well as final, since they can solve problems we forget how to solve
  • Reinforcement Learning Teachers of Test Time Scaling, Sakana 2025
    • Generate COT given solution for distillation, rewarded by student perplexity
  • Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search, 2025
    • Break down COTs into steps, RL to continue/reflect/explore alt branches at each step, and use random rewinds of both successful and unsuccessful rollouts
  • Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs, ByteDance 2025
    • With COT-pass@k metric that uses LLM as PRM, found that RL does improve COTs
  • Reasoning to Learn from Latent Thoughts, Stanford 2025
    • Learn to generate additional thought text interspersed with original (learn using EM). Focused on MATH though where not clearly different from normal reasoning?
  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, 2025
    • Just did RL
  • Hierarchical Reasoning Model, 2025
    • RNN (with xformer blocks) with inner and outer loop blocks forming 2 levels
    • also just trains single layer, no backprop through time
  • Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models (LATS), 2024
    • Test-time MCTS with self-eval, failure reflections
  • Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, 2024
    • Seq revisions vs parallel BON vs beam/tree search