See also Deep/formal reasoning
References
Concepts
Quiet-STaR, 2024
Use REINFORCE to learn helpful “thoughts”
The core technique presented in this paper is called Quiet-STaR (Quiet Self-Taught Reasoner). It aims to teach language models to generate useful internal "thoughts" or rationales to improve their ability to predict future text. The technique operates in three main steps:
Let's go through each step with concrete examples:
Think: Generate rationales in parallel
The model generates short "thoughts" or rationales after each token in the input sequence. These thoughts are meant to help predict future tokens.
Example: Input: "The cat sat on the"
The model might generate thoughts like: After "The": <thought>Likely a noun coming next</thought> After "cat": <thought>Probably a verb next</thought> After "sat": <thought>Location coming up</thought> After "on": <thought>Probably "the" followed by a surface</thought> After "the": <thought>Noun coming, likely a surface</thought>
Talk: Mix predictions with and without rationales
For each token, the model makes two predictions: a) A base prediction without using the thought b) A prediction incorporating the generated thought
These are then combined using a learned "mixing weight" to produce a final prediction.
Example: For predicting the token after "The cat sat on the":
Base prediction: {mat: 0.3, floor: 0.2, chair: 0.1, ...} Prediction with thought: {mat: 0.5, floor: 0.3, chair: 0.05, ...} Mixing weight: 0.7
Final prediction: 0.3 * {mat: 0.3, floor: 0.2, ...} + 0.7 * {mat: 0.5, floor: 0.3, ...}
Learn: Optimize rationale generation
The model learns to generate better rationales by comparing the likelihood of the true next tokens with and without the rationale. Rationales that improve prediction are reinforced.
Example: True next token: "mat"
Likelihood without rationale: 0.3 Likelihood with rationale: 0.5
The model would adjust its parameters to make it more likely to generate thoughts like "Noun coming, likely a surface" in similar contexts, as this thought improved the prediction.
They generate all thoughts in parallel. Key part of this is modifying attention mask accordingly—allows all thought tokens to pay attention to themselves, all preceding thought tokens within the same thought, and the preceding text.
Key aspects:
By iteratively improving its ability to generate useful thoughts, the model learns to reason better about the text it's processing, leading to improved performance on tasks that require reasoning, even without specific fine-tuning for those tasks.
Thoughts look like this—these don’t look like great examples.
[2402.14083] Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping, Meta 2024
Self taught reasoner (STaR), 2022
Just bootstraps rationale-labeled dataset from a few examples into a larger unlabeled dataset
Technique
Start with a pre-trained language model and a dataset of problems with answers (but no rationales).
Example: Model: GPT-J Dataset: CommonsenseQA with questions and correct answers
Provide a small set of few-shot examples with rationales to prompt the model.
Example: Few-shot prompt: Q: What do people use to absorb extra ink from a fountain pen? Answer Choices: (a) shirt pocket, (b) calligrapher's hand, (c) inkwell, (d) desk drawer, (e) blotter A: The answer must be used to absorb extra ink. Blotters are designed to absorb liquids. Therefore, the answer is blotter (e).
Use the model to generate rationales and answers for all problems in the dataset.
Example: Q: Where do you put your grapes just before checking out? Answer Choices: (a) mouth, (b) grocery cart, (c) super market, (d) fruit basket, (e) fruit market Generated rationale and answer: The answer must be a place where you put groceries before checking out. Grocery carts are used to hold items while shopping. Therefore, the answer is grocery cart (b).
Filter the generated rationales, keeping only those that led to correct answers.
Example: Keep the rationale from step 3 since it led to the correct answer (b).
For problems where the model generated incorrect answers, perform "rationalization": a. Provide the correct answer as a hint. b. Ask the model to generate a new rationale for the correct answer.
Example: Q: What home entertainment equipment requires cable? Answer Choices: (a) radio shack, (b) substation, (c) television, (d) cabinet, (e) desk Hint: The correct answer is (c) television. Generated rationalization: The answer must require cable. Cable is used to provide satellite channels to televisions. Therefore, the answer is television (c).
Combine the filtered rationales from step 4 and the rationalizations from step 5 into a new dataset.
Fine-tune the original pre-trained model on this new dataset of questions, rationales, and answers.
Repeat steps 3-7 for multiple iterations, using the newly fine-tuned model each time.
Example of improvement over iterations: Iteration 1: Model solves 60% of problems correctly Iteration 2: Model solves 65% of problems correctly Iteration 3: Model solves 70% of problems correctly
Natural question: what about rationalization-only? A reviewer asked this too. They ran a new experiment:
Rationalization-only training on CQA reaches 69.2%, which is higher than rationale generation without rationalization (68.8%), but lower than the combination (72.5%).
The drop from rationalization-only training is not necessarily unexpected, as it is presumably easier for the model to produce the right answer with a bad rationale if it is told the correct answer in advance.
[2211.09066] Teaching Algorithmic Reasoning via In-context Learning
[2303.04910] Baldur: Whole-Proof Generation and Repair with Large Language Models
[2202.01344] Formal Mathematics Statement Curriculum Learning
https://arxiv.org/abs/2401.08967 ReFT: Reasoning with Reinforced Fine-Tuning
https://arxiv.org/abs/2401.00757 A & B == B & A: Triggering Logical Reasoning Failures in Large Language Models
Self critique
Self refine
Self-consistency chain of thought
Tree of thought
Self-reflection
Least to most decomposition
Domain adaptation
Self improve