LLM evals | Notion

ARC: grade school questions
HellaSwag: common sense
MMLU: elementary math, US history, CS, law, etc.
Gsm8k: grade school math
TruthfulQA: does it spread common online falsehoods
HumanEval: coding (TODO)
EvalPlus: more rigorous reasoning/coding exercises
HumanEvalPack, expanding the HumanEval benchmark to a total of 3 coding tasks (Code Repair, Code Explanation, Code Synthesis) across 6 languages (Python, JavaScript, Java, Go, C++, Rust)
Meta evals
- HELM (Stanford): includes mmlu, gsm8k, others
- Big bench (Google)
Findings
- Of SO questions: 52% of ChatGPT’s answers contain inaccuracies and 77% are verbose (source)
- Forgetting middle of long contexts: https://arxiv.org/pdf/2307.03172.pdf
- Self-repair only slightly improves perf paper
Leaderboards