Library / Safety

Safety & alignment essentials

The fifteen reads that, between them, cover the present case for taking AI safety seriously and the strongest research responses to it.

The list mixes foundational papers, accessible books, and concrete evaluation results. Read top to bottom for a chronological view of how the field's thinking has evolved.

01
Concrete Problems in AI Safety
Paper
Amodei et al. · 2016
The paper that defined the modern alignment research agenda.
Why read this. Almost every later safety paper traces back here.
Open
02
Superintelligence
Book
Nick Bostrom · 2014
The book that put AGI risk on the global policy agenda.
Why read this. Whether or not you agree, you should know the argument.
Open
03
Human Compatible
Book
Stuart Russell · 2019
Russell's argument for redesigning AI around uncertainty about human preferences.
Why read this. Reframes alignment as a property of the system, not an afterthought.
Open
04
The Alignment Problem
Book
Brian Christian · 2020
Accessible reportorial history of the alignment field.
Why read this. The best lay introduction; also useful for technical readers.
Open
05
Risks from Learned Optimization
Paper
Hubinger et al. · 2019
Formalises mesa-optimisation and inner alignment.
Why read this. Foundational vocabulary for alignment researchers.
Open
06
Sleeper Agents
Paper
Hubinger et al. · 2024
Anthropic shows that deceptive behaviour can persist through standard safety training.
Why read this. Empirical evidence for a previously theoretical concern.
Open
07
Discovering Language Model Behaviors with Model-Written Evaluations
Paper
Perez et al. · 2023
Uses models to generate large-scale behavioural evaluations of other models.
Why read this. Templated the modern eval pipeline.
Open
08
AI Control: Improving Safety Despite Intentional Subversion
Paper
Greenblatt et al. · 2024
Founding paper for the 'AI control' research agenda.
Why read this. Practical safety even under pessimistic alignment assumptions.
Open
09
Frontier Model Safety Frameworks
Report
OpenAI / Anthropic / DeepMind · 2023–2025
Public responsible-scaling policies from the three leading labs.
Why read this. Know the commitments the frontier labs hold themselves to.
Open
10
An Overview of Catastrophic AI Risks
Paper
Hendrycks et al. · 2023
A structured taxonomy of risk pathways from advanced AI.
Why read this. Useful map of the threat landscape, written for non-specialists.
Open
11
Scaling Monosemanticity
Paper
Anthropic · 2024
Extracts millions of interpretable features from Claude.
Why read this. Best example of interpretability becoming production-grade.
Open
12
Eliciting Latent Knowledge
Report
Christiano et al. · 2021
A long, recursive write-up of one of the hardest open problems in alignment.
Why read this. Read to feel how hard alignment actually is.
Open
13
Frontier AI Risks
Report
Apollo Research · 2024
Apollo's evaluation reports on deception and scheming in frontier models.
Why read this. Concrete model-eval results, updated regularly.
Open
14
Why I think more NLP researchers should engage with AI safety concerns
Post
Sam Bowman · 2022
A measured invitation for mainstream ML researchers to take safety seriously.
Why read this. Best bridge from research curiosity to safety engagement.
Open
15
AI Safety Fundamentals course
Course
BlueDot Impact · 2024
A free 8-week structured course in technical AI safety and AI governance.
Why read this. The shortest route from interested to genuinely informed.
Open

Safety & alignment essentials

Concrete Problems in AI Safety

Superintelligence

Human Compatible

The Alignment Problem

Risks from Learned Optimization

Sleeper Agents

Discovering Language Model Behaviors with Model-Written Evaluations

AI Control: Improving Safety Despite Intentional Subversion

Frontier Model Safety Frameworks

An Overview of Catastrophic AI Risks

Scaling Monosemanticity

Eliciting Latent Knowledge

Frontier AI Risks

Why I think more NLP researchers should engage with AI safety concerns

AI Safety Fundamentals course