AGI benchmarks explained
Every capability claim in this industry is backed by a benchmark. Here is what each of the ten most-cited ones actually measures — and where they fail.
A benchmark is a fixed dataset of problems with a defined scoring rule. A score on MMLU is not a score on 'intelligence' — it is a score on 57 specific tests written by humans. Knowing the texture of each benchmark is the difference between reading a model release with curiosity and reading it with judgment.
- 01
MMLU
Multidisciplinary knowledgeMassive Multitask Language Understanding57 subjects spanning STEM, humanities, and professional fields. Long the default headline benchmark for LLM general knowledge.
State of the art. Saturated above 90% by frontier models.
- 02
MMLU-Pro
Harder multidisciplinaryA harder, reasoning-focused replacement for MMLU with ten answer choices and graduate-level questions.
State of the art. Active leaderboard; rapidly climbing.
- 03
GPQA
Hard sciencesGraduate-level Google-Proof Q&AGraduate-level biology, physics, and chemistry questions designed to be unsearchable on the open web.
State of the art. Frontier reasoning models now exceed PhD-level human accuracy on the diamond subset.
- 04
ARC-AGI
Abstract reasoningFrancois Chollet's grid-puzzle benchmark designed to resist memorisation and measure fluid intelligence.
State of the art. ARC-AGI-2 remains a major open benchmark for AGI-style generalisation.
- 05
HumanEval
Code generation164 Python programming problems with unit tests, introduced alongside OpenAI Codex.
State of the art. Largely saturated; superseded by SWE-Bench-style benchmarks.
- 06
SWE-Bench Verified
Real-world software engineeringHuman-verified subset of real GitHub issues from popular Python repositories; the model must produce a patch that passes the project's own tests.
State of the art. Frontier coding agents now resolve a majority of verified tasks.
- 07
FrontierMath
Research-level mathematicsA benchmark of original research-level problems built with the help of leading mathematicians; explicitly designed to resist memorisation.
State of the art. Best models still solve well under half of the problems.
- 08
Humanity's Last Exam (HLE)
Expert-level breadthClosed-set exam of thousands of expert-written questions intended to be the final broad knowledge benchmark before saturation.
State of the art. Frontier reasoning models clear roughly a quarter of questions.
- 09
MLE-Bench
ML engineeringOpenAI benchmark in which models attempt full Kaggle competitions end-to-end.
State of the art. Agentic systems earn medals on a meaningful fraction of competitions.
- 10
GAIA
General AI assistantsReal-world assistant tasks requiring tool use, web browsing, and multi-step reasoning.
State of the art. The de facto evaluation suite for agentic assistants.
How to use this list: when a lab announces a new state of the art, check which benchmark before you check by how much. A jump on FrontierMath is far harder to game than a jump on MMLU.