Safety & alignment essentials
The fifteen reads that, between them, cover the present case for taking AI safety seriously and the strongest research responses to it.
The list mixes foundational papers, accessible books, and concrete evaluation results. Read top to bottom for a chronological view of how the field's thinking has evolved.
- 01
Concrete Problems in AI Safety
PaperAmodei et al. · 2016The paper that defined the modern alignment research agenda.
Why read this. Almost every later safety paper traces back here.
- 02
Superintelligence
BookNick Bostrom · 2014The book that put AGI risk on the global policy agenda.
Why read this. Whether or not you agree, you should know the argument.
- 03
Human Compatible
BookStuart Russell · 2019Russell's argument for redesigning AI around uncertainty about human preferences.
Why read this. Reframes alignment as a property of the system, not an afterthought.
- 04
The Alignment Problem
BookBrian Christian · 2020Accessible reportorial history of the alignment field.
Why read this. The best lay introduction; also useful for technical readers.
- 05
Risks from Learned Optimization
PaperHubinger et al. · 2019Formalises mesa-optimisation and inner alignment.
Why read this. Foundational vocabulary for alignment researchers.
- 06
Sleeper Agents
PaperHubinger et al. · 2024Anthropic shows that deceptive behaviour can persist through standard safety training.
Why read this. Empirical evidence for a previously theoretical concern.
- 07
Discovering Language Model Behaviors with Model-Written Evaluations
PaperPerez et al. · 2023Uses models to generate large-scale behavioural evaluations of other models.
Why read this. Templated the modern eval pipeline.
- 08
AI Control: Improving Safety Despite Intentional Subversion
PaperGreenblatt et al. · 2024Founding paper for the 'AI control' research agenda.
Why read this. Practical safety even under pessimistic alignment assumptions.
- 09
Frontier Model Safety Frameworks
ReportOpenAI / Anthropic / DeepMind · 2023–2025Public responsible-scaling policies from the three leading labs.
Why read this. Know the commitments the frontier labs hold themselves to.
- 10
An Overview of Catastrophic AI Risks
PaperHendrycks et al. · 2023A structured taxonomy of risk pathways from advanced AI.
Why read this. Useful map of the threat landscape, written for non-specialists.
- 11
Scaling Monosemanticity
PaperAnthropic · 2024Extracts millions of interpretable features from Claude.
Why read this. Best example of interpretability becoming production-grade.
- 12
Eliciting Latent Knowledge
ReportChristiano et al. · 2021A long, recursive write-up of one of the hardest open problems in alignment.
Why read this. Read to feel how hard alignment actually is.
- 13
Frontier AI Risks
ReportApollo Research · 2024Apollo's evaluation reports on deception and scheming in frontier models.
Why read this. Concrete model-eval results, updated regularly.
- 14
Why I think more NLP researchers should engage with AI safety concerns
PostSam Bowman · 2022A measured invitation for mainstream ML researchers to take safety seriously.
Why read this. Best bridge from research curiosity to safety engagement.
- 15
AI Safety Fundamentals course
CourseBlueDot Impact · 2024A free 8-week structured course in technical AI safety and AI governance.
Why read this. The shortest route from interested to genuinely informed.