• See also Reasoning, Reinforcement learning research, Reinforcement learning foundations

  • References

    • Szegedy formal reasoning
  • Concepts

    • See Theorem provers
    • Challenges
      • Infinite action space
        • Must create terms (AlphaGeometry)
      • Not naturally able to self-play
      • Long term planning for complicated mathematical deductions
      • Retrieval from large, growing, dynamic databases of facts - must select existing statements to apply
      • Generating useful subexpression from experience
      • Experience based conjecturing
      • Synthesizing useful definitions and formalizing concepts
      • Finding hidden far-away connections between different areas.
      • Understanding deep mathematical concepts from natural language input
  • Datasets/benchmarks

    • MiniF2F, OpenAI 2022: olympiad problems, for multiple math proof systems
      • train/test split on just the standard libraries like mathlib isn’t comparable across systems. Still not completely fair comparison here either since you have different levels of infra between diff systems, but better.
    • Llema
      • code llama trained on Proof-Pile II
      • Beats Minerva
  • Teaching Large Language Models to Reason with Reinforcement Learning, Meta 2024

    • Compares: expert iteration, PPO, return-conditioned RL. All similar perf. EI usu. best.
  • AlphaGeometry, Google 2023

  • [2305.20050] Let's Verify Step by Step 2023

    • Focuses on just training reward models for outcome vs process. Due to this focus, didn’t talk about RL updates to the generator. Generate many samples, find correct.

    Untitled

    Untitled

  • Minerva: Solving Quantitative Reasoning Problems with Language Models – Google Research Blog, Google 2022

    • collecting training data that is relevant for quantitative reasoning problems
    • It is a pre-trained PaLM [2]finetuned on some maths datasets (which use LaTeX) composed of maths webpages and Arxiv papers (38.5B tokens)
      • MATH
      • MMLU-STEM
      • GSM8k
    • Minerva combines several techniques, including few-shot prompting, chain of thought or scratchpad prompting, and majority voting, to achieve state-of-the-art performance on STEM reasoning tasks

    Untitled

  • Formal Mathematics Statement Curriculum Learning, Polu, OpenAI 2022

    • Problem: not able to self-play. Self-play’s main value is generating unsupervised curriculum.

    • Idea: do “expert iteration” to iteratively build up better and better models that can generate harder and harder proofs.

      • Generate Lean proofs

      • Each iteration, you retrain from scratch, including all the latest proofs generated by your prev best model

      • Start by training on mathlib. Throw away the proofs, generate proofs for the same statements. Doesn’t overfit, generates a diversity of new proofs for the same statements. Get left chart. Then apply to minif2f (olympiad).

        Untitled

    • https://www.youtube.com/watch?v=FVnPizZUA74

  • Generative Language Modeling for Automated Theorem Proving, Polu, OpenAI 2020

    • Generate Metamath proofs - focused on generating original terms

    • Transformer that predicts next proof step, from current “state”

      Untitled

    • Perf improves from pretraining on general web text

      Untitled

    • Doubling attempts linearly increases perf

      Untitled

    • https://www.youtube.com/watch?v=FVnPizZUA74

  • Training Verifiers to Solve Math Word Problems, Kobbe, OpenAI 2020

    • Introduce GSM8K. Train a verifier and use it as a filter in pass@k to improve LLM perf on GSM8K.

    • Compared fine tuning vs FT+verifier

    • To build verifier: simply hand label generated outputs

      Untitled

    • Interestingly, token-level verifier better than whole-solution-level

  • Fault-aware ranking model trained to predict errors can significantly improve perf of “ranked pass@1” where a working solution is in first rank (paper)

    Untitled

    Untitled