GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning, Databricks 2025

Paper

Chat

Core insight: Instead of using scalar rewards to update weights (RL approach), GEPA analyzes complete execution traces as text - the prompts, outputs, errors, and feedback - which an LLM can reflect on to propose targeted improvements
Reflection-based mutation: After running 3-5 problems, GEPA asks an LLM to analyze what went wrong and suggest specific prompt fixes (e.g., "failed because it didn't isolate variables" → add "use inverse operations")
Maintains 5-10 candidate prompts: Unsuccessful mutations are discarded, only improvements are added in the pool, creating a small set of specialized variants.
- Test on minibatch first: New mutations are tested on 3 problems before full evaluation - only improvements get added to the candidate pool
- Pareto frontier selection: Keeps prompts that excel at ANY subset of problems, even if just 5% - avoiding convergence to one mediocre generalist prompt.
- [Does it prune? May prune just based on heuristics like %? Claude says]
  1. Strict dominance relationships (rare but checked)
  2. Improvement requirements (must beat parent on minibatch)
  3. Weighted sampling by win frequency (not hard pruning, but influences survival)
Evolution tree structure: Each prompt can spawn mutations, forming a tree where each node inherits and modifies its parent's instructions.
- Occasional merging: ~5 times per optimization run, GEPA can merge two successful branches that evolved different modules, combining their strengths.
Sample efficiency: One failed problem can fix an entire class through reflection ("needs to check units" fixes ALL unit errors), versus RL needing hundreds of examples