• See also Deep/formal reasoning

  • References

    • https://lilianweng.github.io/posts/2023-06-23-agent/
    • Reasoning survey
    • https://github.com/atfortes/LLM-Reasoning-Papers?tab=readme-ov-file
    • https://arxiv.org/abs/2305.14992
    • https://evjang.com/2023/03/26/self-reflection.html
    • https://github.com/atfortes/Awesome-LLM-Reasoning
  • Concepts

    • pass@k: choose answer based on ground truth
    • maj@K: choose answer via voting
    • rerank@K: choose answer use ranking model
  • Reasoning to Learn from Latent Thoughts, Stanford 2025

  • Entropix

    • Intro, Github
    • While generating, monitor entropy and “varentropy” (entropy variance across tokens)
    • If low entropy, sample ~normally, or adjust sampling params. If high entropy, generate COT, explore multiple things, or resample
    • Has OSS repo demoing on Llama-3.2-1B-Instruct
  • Stream Of Search, Noah Goodman, 2024

    • Paper
    • Rather than only training language models on perfect solutions, SoS trains them on the entire search process, including mistakes, dead ends, and backtracking, represented as a flattened string, all generated by symbolic solvers (so this is a key limitation).
    • Done only for a simple “Countdown” game.
    • Further RL/StaR it to surpass and solve problems that symbolic solvers couldn’t (symbolic solver may have limited search space/depth/etc).
  • Quiet-STaR, 2024

    • Use REINFORCE to learn helpful “thoughts”

      Untitled

    • The core technique presented in this paper is called Quiet-STaR (Quiet Self-Taught Reasoner). It aims to teach language models to generate useful internal "thoughts" or rationales to improve their ability to predict future text. The technique operates in three main steps:

      1. Think: Generate rationales in parallel
      2. Talk: Mix predictions with and without rationales
      3. Learn: Optimize rationale generation
    • Let's go through each step with concrete examples:

      1. Think: Generate rationales in parallel

        The model generates short "thoughts" or rationales after each token in the input sequence. These thoughts are meant to help predict future tokens.

        Example: Input: "The cat sat on the"

        The model might generate thoughts like: After "The": <thought>Likely a noun coming next</thought> After "cat": <thought>Probably a verb next</thought> After "sat": <thought>Location coming up</thought> After "on": <thought>Probably "the" followed by a surface</thought> After "the": <thought>Noun coming, likely a surface</thought>

      2. Talk: Mix predictions with and without rationales

        For each token, the model makes two predictions: a) A base prediction without using the thought b) A prediction incorporating the generated thought

        These are then combined using a learned "mixing weight" to produce a final prediction.

        Example: For predicting the token after "The cat sat on the":

        Base prediction: {mat: 0.3, floor: 0.2, chair: 0.1, ...} Prediction with thought: {mat: 0.5, floor: 0.3, chair: 0.05, ...} Mixing weight: 0.7

        Final prediction: 0.3 * {mat: 0.3, floor: 0.2, ...} + 0.7 * {mat: 0.5, floor: 0.3, ...}

        image.png

      3. Learn: Optimize rationale generation

        The model learns to generate better rationales by comparing the likelihood of the true next tokens with and without the rationale. Rationales that improve prediction are reinforced.

        Example: True next token: "mat"

        Likelihood without rationale: 0.3 Likelihood with rationale: 0.5

        The model would adjust its parameters to make it more likely to generate thoughts like "Noun coming, likely a surface" in similar contexts, as this thought improved the prediction.

    • They generate all thoughts in parallel. Key part of this is modifying attention mask accordingly—allows all thought tokens to pay attention to themselves, all preceding thought tokens within the same thought, and the preceding text.

      image.png

    • Key aspects:

      • The model uses special tokens <startofthought> and <endofthought> to denote rationales.
      • Rationales are generated in parallel for efficiency.
      • A "mixing head" learns to determine how much to rely on the rationale-informed prediction vs. the base prediction.
      • The technique uses REINFORCE to provide a learning signal for generating useful rationales.
      • The model is trained on general web text, allowing it to learn to reason about a wide variety of topics.

      By iteratively improving its ability to generate useful thoughts, the model learns to reason better about the text it's processing, leading to improved performance on tasks that require reasoning, even without specific fine-tuning for those tasks.

    • Thoughts look like this—these don’t look like great examples.

      Untitled

  • [2402.14083] Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping, Meta 2024

  • LLM-MCTS, NeurIPS 2023

    • Paper
    • Also uses LLM for both
      • As a world model to predict states (common-sense beliefs about object locations)
      • As a policy to suggest promising actions
    • LLM-MCTS is more focused on physical world planning using common sense knowledge, while RAP is a more general framework for complex reasoning tasks that treats the reasoning process itself as a form of planning
  • LATS, UIUC 2023

    • Paper
    • Like RAP, uses MCTS over LLM, but uses actual envs
    • Comparisons
      • More systematic exploration of possibilities vs ReAct
      • Better grounding through environment feedback vs ToT
      • More reliable feedback through actual interaction vs RAP's simulated outcomes
    • "Since our method is based on Monte Carlo Tree Search and is model-free, one limitation of LATS on decision-making tasks is that it requires the agent to be able to revert to earlier states in the environments... this reversion property is feasible in many real-world applications (despite being not universally applicable in all possible environments)"
  • Reasoning with Language Model is Planning with World Model, Daisy Wang, 2023

    • Paper
    • RAP (Reasoning via Planning): MCTS over model of world that is simulated by LLM.
      • Uses deliberate exploration of alternatives
      • Can simulate and evaluate different paths
      • Can backtrack and try new approaches
      • Balances exploration vs exploitation
    • Has specific prompts for these various roles played by the LLM: action generation/agent, predict next state of world, specific format for math, and reward calculation
    • tldr: simulate model-based RL with an LLM
  • Self taught reasoner (STaR), 2022

    • Just bootstraps rationale-labeled dataset from a few examples into a larger unlabeled dataset
    • Technique
      1. Start with a pre-trained language model and a dataset of problems with answers (but no rationales).
      2. Provide a small set of few-shot examples with rationales to prompt the model.
      3. Use the model to generate rationales and answers for all problems in the dataset.
      4. Filter the generated rationales, keeping only those that led to correct answers.
      5. For problems where the model generated incorrect answers, perform "rationalization": a. Provide the correct answer as a hint. b. Ask the model to generate a new rationale for the correct answer.
      6. Combine the filtered rationales from step 4 and the rationalizations from step 5 into a new dataset.
      7. Fine-tune the original pre-trained model on this new dataset of questions, rationales, and answers.
      8. Repeat steps 3-7 for multiple iterations, using the newly fine-tuned model each time.
    • Natural question: what about rationalization-only? A reviewer asked this too. They ran a new experiment:
  • [2211.09066] Teaching Algorithmic Reasoning via In-context Learning

  • [2303.04910] Baldur: Whole-Proof Generation and Repair with Large Language Models

  • [2202.01344] Formal Mathematics Statement Curriculum Learning

  • https://arxiv.org/abs/2401.08967 ReFT: Reasoning with Reinforced Fine-Tuning

  • https://arxiv.org/abs/2401.00757 A & B == B & A: Triggering Logical Reasoning Failures in Large Language Models

  • Self critique

  • Self refine

  • Self-consistency chain of thought