• ARC: grade school questions
  • HellaSwag: common sense
  • MMLU: elementary math, US history, CS, law, etc.
  • Gsm8k: grade school math
  • TruthfulQA: does it spread common online falsehoods
  • HumanEval: coding (TODO)
  • EvalPlus: more rigorous reasoning/coding exercises
  • HumanEvalPack, expanding the HumanEval benchmark to a total of 3 coding tasks (Code Repair, Code Explanation, Code Synthesis) across 6 languages (Python, JavaScript, Java, Go, C++, Rust)
  • Meta evals
    • HELM (Stanford): includes mmlu, gsm8k, others
    • Big bench (Google)
  • Findings
    • Of SO questions: 52% of ChatGPT’s answers contain inaccuracies and 77% are verbose (source)
    • Forgetting middle of long contexts: https://arxiv.org/pdf/2307.03172.pdf
    • Self-repair only slightly improves perf paper
  • Leaderboards
    • https://chat.lmsys.org/ blind test
    • https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
    • https://llm-leaderboard.streamlit.app/ (old)
    • https://crfm.stanford.edu/helm/latest/#/leaderboard
    • Big Code Models Leaderboard - a Hugging Face Space by bigcode