NoticeThis site demonstrates one possible use of this domain. For acquisition, partnership, or investment inquiries, please use our contact form.

Library / Safety

Safety & alignment essentials

The fifteen reads that, between them, cover the present case for taking AI safety seriously and the strongest research responses to it.

The list mixes foundational papers, accessible books, and concrete evaluation results. Read top to bottom for a chronological view of how the field's thinking has evolved.

  1. 01

    Concrete Problems in AI Safety

    Paper
    Amodei et al. · 2016

    The paper that defined the modern alignment research agenda.

    Why read this. Almost every later safety paper traces back here.

  2. 02

    Superintelligence

    Book
    Nick Bostrom · 2014

    The book that put AGI risk on the global policy agenda.

    Why read this. Whether or not you agree, you should know the argument.

  3. 03

    Human Compatible

    Book
    Stuart Russell · 2019

    Russell's argument for redesigning AI around uncertainty about human preferences.

    Why read this. Reframes alignment as a property of the system, not an afterthought.

  4. 04

    The Alignment Problem

    Book
    Brian Christian · 2020

    Accessible reportorial history of the alignment field.

    Why read this. The best lay introduction; also useful for technical readers.

  5. 05

    Risks from Learned Optimization

    Paper
    Hubinger et al. · 2019

    Formalises mesa-optimisation and inner alignment.

    Why read this. Foundational vocabulary for alignment researchers.

  6. 06

    Sleeper Agents

    Paper
    Hubinger et al. · 2024

    Anthropic shows that deceptive behaviour can persist through standard safety training.

    Why read this. Empirical evidence for a previously theoretical concern.

  7. 07

    Discovering Language Model Behaviors with Model-Written Evaluations

    Paper
    Perez et al. · 2023

    Uses models to generate large-scale behavioural evaluations of other models.

    Why read this. Templated the modern eval pipeline.

  8. 08

    AI Control: Improving Safety Despite Intentional Subversion

    Paper
    Greenblatt et al. · 2024

    Founding paper for the 'AI control' research agenda.

    Why read this. Practical safety even under pessimistic alignment assumptions.

  9. 09

    Frontier Model Safety Frameworks

    Report
    OpenAI / Anthropic / DeepMind · 2023–2025

    Public responsible-scaling policies from the three leading labs.

    Why read this. Know the commitments the frontier labs hold themselves to.

  10. 10

    An Overview of Catastrophic AI Risks

    Paper
    Hendrycks et al. · 2023

    A structured taxonomy of risk pathways from advanced AI.

    Why read this. Useful map of the threat landscape, written for non-specialists.

  11. 11

    Scaling Monosemanticity

    Paper
    Anthropic · 2024

    Extracts millions of interpretable features from Claude.

    Why read this. Best example of interpretability becoming production-grade.

  12. 12

    Eliciting Latent Knowledge

    Report
    Christiano et al. · 2021

    A long, recursive write-up of one of the hardest open problems in alignment.

    Why read this. Read to feel how hard alignment actually is.

  13. 13

    Frontier AI Risks

    Report
    Apollo Research · 2024

    Apollo's evaluation reports on deception and scheming in frontier models.

    Why read this. Concrete model-eval results, updated regularly.

  14. 14

    Why I think more NLP researchers should engage with AI safety concerns

    Post
    Sam Bowman · 2022

    A measured invitation for mainstream ML researchers to take safety seriously.

    Why read this. Best bridge from research curiosity to safety engagement.

  15. 15

    AI Safety Fundamentals course

    Course
    BlueDot Impact · 2024

    A free 8-week structured course in technical AI safety and AI governance.

    Why read this. The shortest route from interested to genuinely informed.