See also Deep/formal reasoning
References
Concepts
Entropix
Stream Of Search, Noah Goodman, 2024
Quiet-STaR, 2024
Use REINFORCE to learn helpful “thoughts”
The core technique presented in this paper is called Quiet-STaR (Quiet Self-Taught Reasoner). It aims to teach language models to generate useful internal "thoughts" or rationales to improve their ability to predict future text. The technique operates in three main steps:
Let's go through each step with concrete examples:
Think: Generate rationales in parallel
The model generates short "thoughts" or rationales after each token in the input sequence. These thoughts are meant to help predict future tokens.
Example: Input: "The cat sat on the"
The model might generate thoughts like: After "The": <thought>Likely a noun coming next</thought> After "cat": <thought>Probably a verb next</thought> After "sat": <thought>Location coming up</thought> After "on": <thought>Probably "the" followed by a surface</thought> After "the": <thought>Noun coming, likely a surface</thought>
Talk: Mix predictions with and without rationales
For each token, the model makes two predictions: a) A base prediction without using the thought b) A prediction incorporating the generated thought
These are then combined using a learned "mixing weight" to produce a final prediction.
Example: For predicting the token after "The cat sat on the":
Base prediction: {mat: 0.3, floor: 0.2, chair: 0.1, ...} Prediction with thought: {mat: 0.5, floor: 0.3, chair: 0.05, ...} Mixing weight: 0.7
Final prediction: 0.3 * {mat: 0.3, floor: 0.2, ...} + 0.7 * {mat: 0.5, floor: 0.3, ...}
Learn: Optimize rationale generation
The model learns to generate better rationales by comparing the likelihood of the true next tokens with and without the rationale. Rationales that improve prediction are reinforced.
Example: True next token: "mat"
Likelihood without rationale: 0.3 Likelihood with rationale: 0.5
The model would adjust its parameters to make it more likely to generate thoughts like "Noun coming, likely a surface" in similar contexts, as this thought improved the prediction.
They generate all thoughts in parallel. Key part of this is modifying attention mask accordingly—allows all thought tokens to pay attention to themselves, all preceding thought tokens within the same thought, and the preceding text.
Key aspects:
By iteratively improving its ability to generate useful thoughts, the model learns to reason better about the text it's processing, leading to improved performance on tasks that require reasoning, even without specific fine-tuning for those tasks.
Thoughts look like this—these don’t look like great examples.
[2402.14083] Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping, Meta 2024
LLM-MCTS, NeurIPS 2023
LATS, UIUC 2023
Reasoning with Language Model is Planning with World Model, Daisy Wang, 2023
Self taught reasoner (STaR), 2022
[2211.09066] Teaching Algorithmic Reasoning via In-context Learning
[2303.04910] Baldur: Whole-Proof Generation and Repair with Large Language Models
[2202.01344] Formal Mathematics Statement Curriculum Learning
https://arxiv.org/abs/2401.08967 ReFT: Reasoning with Reinforced Fine-Tuning
https://arxiv.org/abs/2401.00757 A & B == B & A: Triggering Logical Reasoning Failures in Large Language Models
Self critique
Self refine
Self-consistency chain of thought