Research / Benchmarks

AGI benchmarks explained

Every capability claim in this industry is backed by a benchmark. Here is what each of the ten most-cited ones actually measures — and where they fail.

A benchmark is a fixed dataset of problems with a defined scoring rule. A score on MMLU is not a score on 'intelligence' — it is a score on 57 specific tests written by humans. Knowing the texture of each benchmark is the difference between reading a model release with curiosity and reading it with judgment.

01
MMLU
Multidisciplinary knowledge
Massive Multitask Language Understanding
57 subjects spanning STEM, humanities, and professional fields. Long the default headline benchmark for LLM general knowledge.
State of the art. Saturated above 90% by frontier models.
Benchmark home
02
MMLU-Pro
Harder multidisciplinary
A harder, reasoning-focused replacement for MMLU with ten answer choices and graduate-level questions.
State of the art. Active leaderboard; rapidly climbing.
Benchmark home
03
GPQA
Hard sciences
Graduate-level Google-Proof Q&A
Graduate-level biology, physics, and chemistry questions designed to be unsearchable on the open web.
State of the art. Frontier reasoning models now exceed PhD-level human accuracy on the diamond subset.
Benchmark home
04
ARC-AGI
Abstract reasoning
Francois Chollet's grid-puzzle benchmark designed to resist memorisation and measure fluid intelligence.
State of the art. ARC-AGI-2 remains a major open benchmark for AGI-style generalisation.
Benchmark home
05
HumanEval
Code generation
164 Python programming problems with unit tests, introduced alongside OpenAI Codex.
State of the art. Largely saturated; superseded by SWE-Bench-style benchmarks.
Benchmark home
06
SWE-Bench Verified
Real-world software engineering
Human-verified subset of real GitHub issues from popular Python repositories; the model must produce a patch that passes the project's own tests.
State of the art. Frontier coding agents now resolve a majority of verified tasks.
Benchmark home
07
FrontierMath
Research-level mathematics
A benchmark of original research-level problems built with the help of leading mathematicians; explicitly designed to resist memorisation.
State of the art. Best models still solve well under half of the problems.
Benchmark home
08
Humanity's Last Exam (HLE)
Expert-level breadth
Closed-set exam of thousands of expert-written questions intended to be the final broad knowledge benchmark before saturation.
State of the art. Frontier reasoning models clear roughly a quarter of questions.
Benchmark home
09
MLE-Bench
ML engineering
OpenAI benchmark in which models attempt full Kaggle competitions end-to-end.
State of the art. Agentic systems earn medals on a meaningful fraction of competitions.
Benchmark home
10
GAIA
General AI assistants
Real-world assistant tasks requiring tool use, web browsing, and multi-step reasoning.
State of the art. The de facto evaluation suite for agentic assistants.
Benchmark home

How to use this list: when a lab announces a new state of the art, check which benchmark before you check by how much. A jump on FrontierMath is far harder to game than a jump on MMLU.

AGI benchmarks explained

MMLU

MMLU-Pro

GPQA

ARC-AGI

HumanEval

SWE-Bench Verified

FrontierMath

Humanity's Last Exam (HLE)

MLE-Bench

GAIA