Research / Papers

Landmark AGI papers

The twenty papers, reports, and regulations whose ideas you can find inside almost every frontier model and AI policy today.

This list is opinionated. It privileges papers whose ideas became infrastructure: architectures every lab now uses, training techniques that turned research demos into products, and policy frameworks that govern how the field is allowed to operate. Entries are roughly chronological within their theme.

01
Attention Is All You Need
Vaswani et al. · NeurIPS, 2017
Summary. Introduces the Transformer architecture, replacing recurrence with self-attention.
Why it matters. The architectural foundation of every modern frontier model.
Read the paper
02
Language Models are Few-Shot Learners (GPT-3)
Brown et al. · NeurIPS, 2020
Summary. Demonstrates that scaling a Transformer language model unlocks broad few-shot capability.
Why it matters. Defined the modern LLM paradigm and ignited the scaling race.
Read the paper
03
Scaling Laws for Neural Language Models
Kaplan et al. · arXiv, 2020
Summary. Empirical power-law relationships between loss, model size, dataset size, and compute.
Why it matters. Formalised the predictability of scaling and shaped frontier training plans.
Read the paper
04
Training Compute-Optimal Large Language Models (Chinchilla)
Hoffmann et al. · DeepMind, 2022
Summary. Shows most large models had been undertrained on data; rebalances compute toward more tokens.
Why it matters. Reset the compute-optimal balance every modern training run uses.
Read the paper
05
Highly accurate protein structure prediction with AlphaFold
Jumper et al. · Nature, 2021
Summary. Solves the 50-year-old protein folding problem to near-experimental accuracy.
Why it matters. First domain in which an AI system became the reference instrument for science.
Read the paper
06
Training language models to follow instructions with human feedback (InstructGPT)
Ouyang et al. · NeurIPS, 2022
Summary. Applies RLHF to align a language model with human instructions and preferences.
Why it matters. The technique that made ChatGPT, Claude, and Gemini usable products.
Read the paper
07
Constitutional AI: Harmlessness from AI Feedback
Bai et al. · Anthropic, 2022
Summary. Trains a model to critique and revise its own outputs against a written constitution.
Why it matters. Founding methodology behind Claude and a major thread in scalable oversight.
Read the paper
08
Sparks of Artificial General Intelligence
Bubeck et al. · Microsoft Research, 2023
Summary. Early empirical study of GPT-4 arguing it shows fragments of general intelligence.
Why it matters. Reframed the public debate about how close current systems are to AGI.
Read the paper
09
Emergent Abilities of Large Language Models
Wei et al. · TMLR, 2022
Summary. Catalogues capabilities that appear abruptly past a scale threshold.
Why it matters. Crystallised the discussion of emergence and unpredictability in LLMs.
Read the paper
10
Chain-of-Thought Prompting Elicits Reasoning
Wei et al. · NeurIPS, 2022
Summary. Shows that prompting models to think step-by-step dramatically improves reasoning.
Why it matters. Set the stage for explicit reasoning models such as o1 and o3.
Read the paper
11
Toy Models of Superposition
Elhage et al. · Anthropic, 2022
Summary. Explains how neural networks pack more features than they have dimensions.
Why it matters. Foundational reading for mechanistic interpretability.
Read the paper
12
Scaling Monosemanticity
Templeton et al. · Anthropic, 2024
Summary. Extracts millions of interpretable features from Claude using sparse autoencoders.
Why it matters. Showed mechanistic interpretability can scale to frontier production models.
Read the paper
13
Discovering Language Model Behaviors with Model-Written Evaluations
Perez et al. · Anthropic, 2023
Summary. Uses models to generate large-scale behavioural evaluations of other models.
Why it matters. Templated the modern model-eval pipeline.
Read the paper
14
GPT-4 Technical Report
OpenAI · OpenAI, 2023
Summary. System card and capability summary for the model that defined the frontier in 2023–24.
Why it matters. The reference document for the first widely deployed multimodal frontier model.
Read the paper
15
Gemini: A Family of Highly Capable Multimodal Models
Google DeepMind · DeepMind, 2023
Summary. Introduces Google's natively multimodal model family.
Why it matters. Marked Google's unified post-merger frontier offering.
Read the paper
16
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron et al. · Meta AI, 2023
Summary. Open-weight 7B–70B Llama-2 release with safety training documentation.
Why it matters. Anchored the modern open-weights ecosystem.
Read the paper
17
AlphaProof and AlphaGeometry 2
Google DeepMind · DeepMind, 2024
Summary. Systems that solve International Mathematical Olympiad problems at silver-medal level.
Why it matters. Concrete evidence of frontier mathematical reasoning by AI.
Read the paper
18
International AI Safety Report 2025
Bengio et al. · UK Government, 2025
Summary. First annual consensus report on advanced AI risks, chaired by Yoshua Bengio and backed by 30 countries.
Why it matters. The closest thing the field has to an IPCC-style assessment.
Read the paper
19
NIST AI Risk Management Framework (AI RMF 1.0)
NIST · NIST, 2023
Summary. Voluntary US framework for managing AI risks across the system lifecycle.
Why it matters. The reference governance framework most US enterprises follow.
Read the paper
20
EU AI Act (Regulation 2024/1689)
European Parliament & Council · EUR-Lex, 2024
Summary. Risk-tiered regulation of AI systems, with dedicated rules for general-purpose AI models.
Why it matters. The first comprehensive horizontal AI law from a major jurisdiction.
Read the paper

How to use this list: read the abstracts of the first ten to understand modern AI capabilities; read the last three (International AI Safety Report, NIST AI RMF, EU AI Act) to understand the rules.

Landmark AGI papers

Attention Is All You Need

Language Models are Few-Shot Learners (GPT-3)

Scaling Laws for Neural Language Models

Training Compute-Optimal Large Language Models (Chinchilla)

Highly accurate protein structure prediction with AlphaFold

Training language models to follow instructions with human feedback (InstructGPT)

Constitutional AI: Harmlessness from AI Feedback

Sparks of Artificial General Intelligence

Emergent Abilities of Large Language Models

Chain-of-Thought Prompting Elicits Reasoning

Toy Models of Superposition

Scaling Monosemanticity

Discovering Language Model Behaviors with Model-Written Evaluations

GPT-4 Technical Report

Gemini: A Family of Highly Capable Multimodal Models

Llama 2: Open Foundation and Fine-Tuned Chat Models

AlphaProof and AlphaGeometry 2

International AI Safety Report 2025

NIST AI Risk Management Framework (AI RMF 1.0)

EU AI Act (Regulation 2024/1689)