Reasoning | Notion

See also Deep/formal reasoning
References
- https://lilianweng.github.io/posts/2023-06-23-agent/
- Reasoning survey
- https://github.com/atfortes/LLM-Reasoning-Papers?tab=readme-ov-file
- https://arxiv.org/abs/2305.14992
- https://evjang.com/2023/03/26/self-reflection.html
- https://github.com/atfortes/Awesome-LLM-Reasoning
Concepts
- pass@k: choose answer based on ground truth
- maj@K: choose answer via voting
- rerank@K: choose answer use ranking model
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning, Databricks 2025
Wider or Deeper? Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search, 2025
- Tree search that dynamically chooses between deepening vs broadening
Energy-Based Transformers are Scalable Learners and Thinkers, 2025
- Test-time optimization
Temporal Sampling for Forgotten Reasoning in LLMs, 2025
- Use earlier snapshots as well as final, since they can solve problems we forget how to solve
Reinforcement Learning Teachers of Test Time Scaling, Sakana 2025
- Generate COT given solution for distillation, rewarded by student perplexity
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search, 2025
- Break down COTs into steps, RL to continue/reflect/explore alt branches at each step, and use random rewinds of both successful and unsuccessful rollouts
Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs, ByteDance 2025
- With COT-pass@k metric that uses LLM as PRM, found that RL does improve COTs
Reasoning to Learn from Latent Thoughts, Stanford 2025
- Learn to generate additional thought text interspersed with original (learn using EM). Focused on MATH though where not clearly different from normal reasoning?
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, 2025
- Just did RL
Hierarchical Reasoning Model, 2025
- RNN (with xformer blocks) with inner and outer loop blocks forming 2 levels
- also just trains single layer, no backprop through time
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models (LATS), 2024
- Test-time MCTS with self-eval, failure reflections
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, 2024
- Seq revisions vs parallel BON vs beam/tree search
Entropix
- Intro, Github
- While generating, monitor entropy and “varentropy” (entropy variance across tokens)
- If low entropy, sample ~normally, or adjust sampling params. If high entropy, generate COT, explore multiple things, or resample
- Has OSS repo demoing on Llama-3.2-1B-Instruct
Stream Of Search, Noah Goodman, 2024
- Paper
- Rather than only training language models on perfect solutions, SoS trains them on the entire search process, including mistakes, dead ends, and backtracking, represented as a flattened string, all generated by symbolic solvers (so this is a key limitation).
- Done only for a simple “Countdown” game.
- Further RL/StaR it to surpass and solve problems that symbolic solvers couldn’t (symbolic solver may have limited search space/depth/etc).
Quiet-STaR, 2024
- Use REINFORCE to learn helpful “thoughts”
- The core technique presented in this paper is called Quiet-STaR (Quiet Self-Taught Reasoner). It aims to teach language models to generate useful internal "thoughts" or rationales to improve their ability to predict future text. The technique operates in three main steps:
  1. Think: Generate rationales in parallel
  2. Talk: Mix predictions with and without rationales
  3. Learn: Optimize rationale generation
- Let's go through each step with concrete examples:
  1. Think: Generate rationales in parallel
    
    The model generates short "thoughts" or rationales after each token in the input sequence. These thoughts are meant to help predict future tokens.
    
    Example: Input: "The cat sat on the"
    
    The model might generate thoughts like: After "The": <thought>Likely a noun coming next</thought> After "cat": <thought>Probably a verb next</thought> After "sat": <thought>Location coming up</thought> After "on": <thought>Probably "the" followed by a surface</thought> After "the": <thought>Noun coming, likely a surface</thought>
  2. Talk: Mix predictions with and without rationales
    
    For each token, the model makes two predictions: a) A base prediction without using the thought b) A prediction incorporating the generated thought
    
    These are then combined using a learned "mixing weight" to produce a final prediction.
    
    Example: For predicting the token after "The cat sat on the":
    
    Base prediction: {mat: 0.3, floor: 0.2, chair: 0.1, ...} Prediction with thought: {mat: 0.5, floor: 0.3, chair: 0.05, ...} Mixing weight: 0.7
    
    Final prediction: 0.3 * {mat: 0.3, floor: 0.2, ...} + 0.7 * {mat: 0.5, floor: 0.3, ...}
  3. Learn: Optimize rationale generation
    
    The model learns to generate better rationales by comparing the likelihood of the true next tokens with and without the rationale. Rationales that improve prediction are reinforced.
    
    Example: True next token: "mat"
    
    Likelihood without rationale: 0.3 Likelihood with rationale: 0.5
    
    The model would adjust its parameters to make it more likely to generate thoughts like "Noun coming, likely a surface" in similar contexts, as this thought improved the prediction.
- They generate all thoughts in parallel. Key part of this is modifying attention mask accordingly—allows all thought tokens to pay attention to themselves, all preceding thought tokens within the same thought, and the preceding text.
- Key aspects:
  - The model uses special tokens <startofthought> and <endofthought> to denote rationales.
  - Rationales are generated in parallel for efficiency.
  - A "mixing head" learns to determine how much to rely on the rationale-informed prediction vs. the base prediction.
  - The technique uses REINFORCE to provide a learning signal for generating useful rationales.
  - The model is trained on general web text, allowing it to learn to reason about a wide variety of topics.
  By iteratively improving its ability to generate useful thoughts, the model learns to reason better about the text it's processing, leading to improved performance on tasks that require reasoning, even without specific fine-tuning for those tasks.
- Thoughts look like this—these don’t look like great examples.
[2402.14083] Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping, Meta 2024
LLM-MCTS, NeurIPS 2023
- Paper
- Also uses LLM for both
- LLM-MCTS is more focused on physical world planning using common sense knowledge, while RAP is a more general framework for complex reasoning tasks that treats the reasoning process itself as a form of planning