NoticeThis site demonstrates one possible use of this domain. For acquisition, partnership, or investment inquiries, please use our contact form.

Research / Benchmarks

AGI benchmarks explained

Every capability claim in this industry is backed by a benchmark. Here is what each of the ten most-cited ones actually measures — and where they fail.

A benchmark is a fixed dataset of problems with a defined scoring rule. A score on MMLU is not a score on 'intelligence' — it is a score on 57 specific tests written by humans. Knowing the texture of each benchmark is the difference between reading a model release with curiosity and reading it with judgment.

  1. 01

    MMLU

    Multidisciplinary knowledge
    Massive Multitask Language Understanding

    57 subjects spanning STEM, humanities, and professional fields. Long the default headline benchmark for LLM general knowledge.

    State of the art. Saturated above 90% by frontier models.

  2. 02

    MMLU-Pro

    Harder multidisciplinary

    A harder, reasoning-focused replacement for MMLU with ten answer choices and graduate-level questions.

    State of the art. Active leaderboard; rapidly climbing.

  3. 03

    GPQA

    Hard sciences
    Graduate-level Google-Proof Q&A

    Graduate-level biology, physics, and chemistry questions designed to be unsearchable on the open web.

    State of the art. Frontier reasoning models now exceed PhD-level human accuracy on the diamond subset.

  4. 04

    ARC-AGI

    Abstract reasoning

    Francois Chollet's grid-puzzle benchmark designed to resist memorisation and measure fluid intelligence.

    State of the art. ARC-AGI-2 remains a major open benchmark for AGI-style generalisation.

  5. 05

    HumanEval

    Code generation

    164 Python programming problems with unit tests, introduced alongside OpenAI Codex.

    State of the art. Largely saturated; superseded by SWE-Bench-style benchmarks.

  6. 06

    SWE-Bench Verified

    Real-world software engineering

    Human-verified subset of real GitHub issues from popular Python repositories; the model must produce a patch that passes the project's own tests.

    State of the art. Frontier coding agents now resolve a majority of verified tasks.

  7. 07

    FrontierMath

    Research-level mathematics

    A benchmark of original research-level problems built with the help of leading mathematicians; explicitly designed to resist memorisation.

    State of the art. Best models still solve well under half of the problems.

  8. 08

    Humanity's Last Exam (HLE)

    Expert-level breadth

    Closed-set exam of thousands of expert-written questions intended to be the final broad knowledge benchmark before saturation.

    State of the art. Frontier reasoning models clear roughly a quarter of questions.

  9. 09

    MLE-Bench

    ML engineering

    OpenAI benchmark in which models attempt full Kaggle competitions end-to-end.

    State of the art. Agentic systems earn medals on a meaningful fraction of competitions.

  10. 10

    GAIA

    General AI assistants

    Real-world assistant tasks requiring tool use, web browsing, and multi-step reasoning.

    State of the art. The de facto evaluation suite for agentic assistants.

How to use this list: when a lab announces a new state of the art, check which benchmark before you check by how much. A jump on FrontierMath is far harder to game than a jump on MMLU.