See also Agent reasoning, Reinforcement learning research, Reinforcement learning foundations
References
Concepts
Datasets/benchmarks
Teaching Large Language Models to Reason with Reinforcement Learning, Meta 2024
AlphaGeometry, Google 2023
[2305.20050] Let's Verify Step by Step 2023
Minerva: Solving Quantitative Reasoning Problems with Language Models – Google Research Blog, Google 2022
Formal Mathematics Statement Curriculum Learning, Polu, OpenAI 2022
Problem: not able to self-play. Self-play’s main value is generating unsupervised curriculum.
Idea: do “expert iteration” to iteratively build up better and better models that can generate harder and harder proofs.
Generate Lean proofs
Each iteration, you retrain from scratch, including all the latest proofs generated by your prev best model
Start by training on mathlib. Throw away the proofs, generate proofs for the same statements. Doesn’t overfit, generates a diversity of new proofs for the same statements. Get left chart. Then apply to minif2f (olympiad).
Generative Language Modeling for Automated Theorem Proving, Polu, OpenAI 2020
Generate Metamath proofs - focused on generating original terms
Transformer that predicts next proof step, from current “state”
Perf improves from pretraining on general web text
Doubling attempts linearly increases perf
Training Verifiers to Solve Math Word Problems, Kobbe, OpenAI 2020
Introduce GSM8K. Train a verifier and use it as a filter in pass@k to improve LLM perf on GSM8K.
Compared fine tuning vs FT+verifier
To build verifier: simply hand label generated outputs
Interestingly, token-level verifier better than whole-solution-level
Fault-aware ranking model trained to predict errors can significantly improve perf of “ranked pass@1” where a working solution is in first rank (paper)