Landmark AGI papers
The twenty papers, reports, and regulations whose ideas you can find inside almost every frontier model and AI policy today.
This list is opinionated. It privileges papers whose ideas became infrastructure: architectures every lab now uses, training techniques that turned research demos into products, and policy frameworks that govern how the field is allowed to operate. Entries are roughly chronological within their theme.
- 01
Attention Is All You Need
Vaswani et al. · NeurIPS, 2017Summary. Introduces the Transformer architecture, replacing recurrence with self-attention.
Why it matters. The architectural foundation of every modern frontier model.
- 02
Language Models are Few-Shot Learners (GPT-3)
Brown et al. · NeurIPS, 2020Summary. Demonstrates that scaling a Transformer language model unlocks broad few-shot capability.
Why it matters. Defined the modern LLM paradigm and ignited the scaling race.
- 03
Scaling Laws for Neural Language Models
Kaplan et al. · arXiv, 2020Summary. Empirical power-law relationships between loss, model size, dataset size, and compute.
Why it matters. Formalised the predictability of scaling and shaped frontier training plans.
- 04
Training Compute-Optimal Large Language Models (Chinchilla)
Hoffmann et al. · DeepMind, 2022Summary. Shows most large models had been undertrained on data; rebalances compute toward more tokens.
Why it matters. Reset the compute-optimal balance every modern training run uses.
- 05
Highly accurate protein structure prediction with AlphaFold
Jumper et al. · Nature, 2021Summary. Solves the 50-year-old protein folding problem to near-experimental accuracy.
Why it matters. First domain in which an AI system became the reference instrument for science.
- 06
Training language models to follow instructions with human feedback (InstructGPT)
Ouyang et al. · NeurIPS, 2022Summary. Applies RLHF to align a language model with human instructions and preferences.
Why it matters. The technique that made ChatGPT, Claude, and Gemini usable products.
- 07
Constitutional AI: Harmlessness from AI Feedback
Bai et al. · Anthropic, 2022Summary. Trains a model to critique and revise its own outputs against a written constitution.
Why it matters. Founding methodology behind Claude and a major thread in scalable oversight.
- 08
Sparks of Artificial General Intelligence
Bubeck et al. · Microsoft Research, 2023Summary. Early empirical study of GPT-4 arguing it shows fragments of general intelligence.
Why it matters. Reframed the public debate about how close current systems are to AGI.
- 09
Emergent Abilities of Large Language Models
Wei et al. · TMLR, 2022Summary. Catalogues capabilities that appear abruptly past a scale threshold.
Why it matters. Crystallised the discussion of emergence and unpredictability in LLMs.
- 10
Chain-of-Thought Prompting Elicits Reasoning
Wei et al. · NeurIPS, 2022Summary. Shows that prompting models to think step-by-step dramatically improves reasoning.
Why it matters. Set the stage for explicit reasoning models such as o1 and o3.
- 11
Toy Models of Superposition
Elhage et al. · Anthropic, 2022Summary. Explains how neural networks pack more features than they have dimensions.
Why it matters. Foundational reading for mechanistic interpretability.
- 12
Scaling Monosemanticity
Templeton et al. · Anthropic, 2024Summary. Extracts millions of interpretable features from Claude using sparse autoencoders.
Why it matters. Showed mechanistic interpretability can scale to frontier production models.
- 13
Discovering Language Model Behaviors with Model-Written Evaluations
Perez et al. · Anthropic, 2023Summary. Uses models to generate large-scale behavioural evaluations of other models.
Why it matters. Templated the modern model-eval pipeline.
- 14
GPT-4 Technical Report
OpenAI · OpenAI, 2023Summary. System card and capability summary for the model that defined the frontier in 2023–24.
Why it matters. The reference document for the first widely deployed multimodal frontier model.
- 15
Gemini: A Family of Highly Capable Multimodal Models
Google DeepMind · DeepMind, 2023Summary. Introduces Google's natively multimodal model family.
Why it matters. Marked Google's unified post-merger frontier offering.
- 16
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron et al. · Meta AI, 2023Summary. Open-weight 7B–70B Llama-2 release with safety training documentation.
Why it matters. Anchored the modern open-weights ecosystem.
- 17
AlphaProof and AlphaGeometry 2
Google DeepMind · DeepMind, 2024Summary. Systems that solve International Mathematical Olympiad problems at silver-medal level.
Why it matters. Concrete evidence of frontier mathematical reasoning by AI.
- 18
International AI Safety Report 2025
Bengio et al. · UK Government, 2025Summary. First annual consensus report on advanced AI risks, chaired by Yoshua Bengio and backed by 30 countries.
Why it matters. The closest thing the field has to an IPCC-style assessment.
- 19
NIST AI Risk Management Framework (AI RMF 1.0)
NIST · NIST, 2023Summary. Voluntary US framework for managing AI risks across the system lifecycle.
Why it matters. The reference governance framework most US enterprises follow.
- 20
EU AI Act (Regulation 2024/1689)
European Parliament & Council · EUR-Lex, 2024Summary. Risk-tiered regulation of AI systems, with dedicated rules for general-purpose AI models.
Why it matters. The first comprehensive horizontal AI law from a major jurisdiction.
How to use this list: read the abstracts of the first ten to understand modern AI capabilities; read the last three (International AI Safety Report, NIST AI RMF, EU AI Act) to understand the rules.