-
Decision transformer TODO
-
Reverse curriculum learning for reasoning, Fudan 2024
- Idea: They require expert demonstrations of reasoning, then iteratively make the model learn more of the trajectory
- Paper
-
LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models
-
Idea: give GPT4 the task, correct answer, and failed step by step reasonings, have it generate a rubric of evaluation criteria, and apply these to new reasonings
-
Example of how AutoRace might be used to evaluate reasoning chains for a mathematical reasoning task like GSM8k.
- Collect incorrect reasoning chains:
First, AutoRace would use a student LLM (e.g. Llama-2 70B) to generate reasoning chains for a sample of GSM8k problems. It would collect a set of chains that led to incorrect answers.
- Error detection:
GPT-4 is given the correct answer and would then analyze these incorrect chains to identify specific errors. Evaluates every step.
Question: Janet's ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?
Incorrect chain:
Janet's ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. This means she uses 3 + 4 = 7 eggs every day. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. So she sells (16 - 7) * $2 = $6 worth of eggs every day. The answer is 6.
GPT-4 is given the correct answer and might identify the error: "The calculation of the final amount is incorrect. It should be (16 - 7) * $2 = $18, not $6."
-
Criteria summarization:
After analyzing several such examples, GPT-4 would generate a list of evaluation criteria, such as:
- Accuracy in Mathematical Operations
- Understanding the Problem Statement
- Correct Application of Mathematical Concepts
- Unit Conversion and Appropriateness
- Final Answer Relevance
- Logical Reasoning and Step-by-Step Explanation
-
Evaluation:
To evaluate a new reasoning chain, AutoRace would prompt GPT-4 with the question, the student's reasoning chain, and the criteria list. GPT-4 would then assess the chain against each criterion.
For example, given a new chain:
Question: Claire makes a 3 egg omelet every morning for breakfast. How many dozens of eggs will she eat in 4 weeks?
Chain: Claire makes a 3 egg omelet every morning. In one week she will eat 3 * 7 = 21 eggs. In 4 weeks she will eat 4 * 21 = 84 eggs. The answer is 84.
GPT-4 might evaluate:
- Step 1: accuracy: ✅, understanding: ✅, …
- Step 2: accuracy: ✅, understanding: ✅, …
- Step 3: accuracy: ✅, understanding: ❌ missed dozens, …
Based on this evaluation, AutoRace would classify this chain as INCORRECT.
-
Paper
-
Efficient Reinforcement Learning via Large Language Model-based Search, 2024
- Idea: use handcoded simplified RL env and a handcoded model “guide” for valid vs invalid actions (MEDIC), then use resulting plan as dense rewards, using just common action subsequence.
- Core Technique - MEDIC Framework:
a) Simplification: They start by creating a simpler, deterministic version of the original RL problem.
b) LLM Interaction: They use an LLM to try to solve this simplified problem by having it suggest actions step-by-step.
c) Feedback Mechanism: They create a "Model-based feedback critic" (MEDIC) that checks if the LLM's suggested actions are valid. If not, it provides feedback to the LLM to try again.
d) Guide Policy: Through this interaction, they generate a valid (though possibly suboptimal) plan for solving the simplified problem.
e) Reward Shaping: They use this plan to create a reward shaping function. Essentially, they assign small positive rewards to state-action pairs that align with the LLM-generated plan.
- Example
- Original Problem:
- The agent is in a 5x5 grid world.
- There's a key, a locked door, and a goal somewhere in the grid.
- The agent needs to pick up the key, unlock the door, and reach the goal.
- The environment is stochastic (e.g., actions might not always have the expected outcome).
- The reward is sparse (only given when reaching the goal).
- Simplified Problem for LLM:
- The same 5x5 grid, but now deterministic (actions always have the expected outcome).
- The LLM is given a text description of the grid state.
- LLM Interaction with MEDIC:
LLM: "Move forward"
MEDIC: "Invalid action. You're facing a wall. Valid actions are: turn left, turn right"
LLM: "Turn right"
MEDIC: "Valid action. New state: ..."
LLM: "Move forward"
MEDIC: "Valid action. You're now next to the key. ..."
... (this continues until a complete plan is formed)
- Resulting Guide Policy:
A sequence like: [turn right, move forward, pickup key, turn left, move forward, open door, move forward, move forward]
- Reward Shaping:
Now, when training the actual RL agent, in addition to the sparse goal reward, it gets small positive rewards for actions that align with this guide policy. For example:
- +0.1 for turning right at the start
- +0.2 for moving forward after that
- +0.3 for picking up the key when next to it
... and so on
- RL Training:
The RL agent is trained on the original stochastic environment, but with these additional shaped rewards. This helps guide the agent towards a potentially good solution, even before it manages to reach the goal and get the main reward.
"Given a valid plan for the relaxed deterministic problem generated by our MEDIC-augmented LLM framework, we assign uniformly increasing rewards to each (state, action) pair that is part of the plan. The total reward assigned to this plan is +1. During the RL training stage, we check for (state, action) pairs in the training buffer that correspond to those that are part of the MEDIC-augmented LLM-generated plan."
- This suggests that they are checking for exact matches of (state, action) pairs from the guide plan in the RL agent's experience buffer. The "uniformly increasing" nature of the rewards implies that they are maintaining the subsequence order, with later steps in the guide plan receiving higher rewards.
- Some questions I had
- Do authors manually create the simplified env? Yes
- Is MEDIC same thing as simplified env? No, it’s something you can query without.
- What’s the point if env is cheap to query against for valid/invalid? NOT CLEAR
- Is MEDIC learned or handcoded? Handcoded
- Paper
-
Divide and conquer MCTS for goal-directed planning, Parascandolo, Deepmind 2020
- Idea: Find intermediate goals, and solve those (simpler) problems, recursively.
- Need to try several intermediate goals. So it’s almost like a tree search over tree searches.
- Of course, the problem is that there are many possible intermediate goals (O(S) where S is state space). So everything rests on: can you learn a NN that predicts good candidates?
-
Adversarial inverse RL, 2018
- From expert trajectories, you could be learning many possible reward functions
- Improve transfer to new envs by generating adversarial examples that ablate the
- Similar to maxent iRL, where many potential policies can be learned
-
Curiosity-driven exploration by self-supervised prediction, 2017
- Problem: sparse rewards
- Can train a model to predict the future. E.g. in VizDoom, if I take a step forward, what will happen / what will I see? If I’m wrong, that gives me a reward. Maximize pixel surprise. Good for the algo to explore by itself.
![Untitled](<https://prod-files-secure.s3.us-west-2.amazonaws.com/fd0e9e06-47ef-46a5-bfc2-18344b58b466/0cd5a84f-0d79-44e4-80df-7e5d98e84e4b/Untitled.png>)
- A naive implementation would simply predict next states $s_{t+1}$, and compare that with actual next state. Problem: there will be parts of the env that change pixels unexpectedly but which your actions have no influence over, such as the movement of some leaves. Algo will always be surprised/rewarded by the leaves.
- Instead, want to predict and compare features of states that depend on actions—learn these with an encoder network $\phi$. Have an inverse model that, given current/next features, predicts the original action. Train the encoder end-to-end with this inverse model. Once you have this $\phi$, train the forward model.
- Yannic
- Paper
-
Hindsight experience replay, OpenAI 2017
-
Problem: sparse rewards
-
In RL, rewards are sparse. Insight: failure is a teacher. Adopt more goals to learn more about the world.
- Also, in multi-task RL, similarly is sample inefficient.
-
Learn goal-conditioned policies, which generalize over the goal space:
-
Remember these goals
-
Algorithm: after episode, sample additional goals and store in replay buffer
-
https://www.youtube.com/watch?v=77xkqEAsHFI
-
http://chronos.isir.upmc.fr/~sigaud/teach/her.pdf
-
Note: still must learn good representations of state space
-
Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions TODO
-
MuZero, Deepmind 2020
- MuZero (MZ) is a combination of the high-performance planning of the AlphaZero (AZ) algorithm with approaches to model-free reinforcement learning.
-
AlphaZero, Deepmind 2018
-
AlphaGo Zero, Deepmind 2017
-
AlphaGo, Deepmind 2015
-
Uses MCTS
-
Game background: 19x19 grid
-
3 policy networks
- Supervised policy network trained on DB of human plays (30M positions). CNN. Predict what’s the next move (classification).
- Rollout policy. Small linear layer. Also classification. Some go handcrafting. Fast, 2us (vs 3ms for SL).
- RL policy network. Initialized from SL. Self-play trained with policy gradient methods.
- Value network predicts likelihood to win, given state. Trained with (state, game outcome) from the self-play games. Just MSE regression.
-
In final MCTS engine, found that using SL as policy network to guide the MCTS instead of the RL one was better (but still using the value network)
-
https://www.youtube.com/watch?v=Z1BELqFQZVM
-
Learning hierarchy of actions with meta learning, OpenAI 2017
-
Reverse Curriculum Generation for Reinforcement Learning. Carlos Florensa, David Held, Markus Wulfmeier, Pieter Abbeel. Published at the Conference on Robot Learning (CoRL) 2017
-
Reinforcement Learning with Unsupervised Auxiliary Tasks. Google 2016.
-
Inverse RL, Ng 2000