- ARC: grade school questions
- HellaSwag: common sense
- MMLU: elementary math, US history, CS, law, etc.
- Gsm8k: grade school math
- TruthfulQA: does it spread common online falsehoods
- HumanEval: coding (TODO)
- EvalPlus: more rigorous reasoning/coding exercises
- HumanEvalPack, expanding the HumanEval benchmark to a total of 3 coding tasks (Code Repair, Code Explanation, Code Synthesis) across 6 languages (Python, JavaScript, Java, Go, C++, Rust)
- Meta evals
- HELM (Stanford): includes mmlu, gsm8k, others
- Big bench (Google)
- Findings
- Leaderboards