Insights

Short reads on the AGI transition

Eight curated, opinionated takes on capabilities, evaluation, agents, alignment, and policy — each under five minutes.

01AGI capabilities · 4 min
Scaling is not understanding
Bigger models keep getting better at benchmarks, but capability and comprehension are different axes.
Scaling laws describe how loss falls predictably as parameters, data, and compute grow. They do not describe whether a model has a coherent world model, only that next-token prediction error shrinks.
Capability jumps that look like understanding often turn out to be sharper pattern completion on tasks well covered by training data. Carefully held-out probes typically widen the gap again.
Treat scaling as a force multiplier for whatever cognitive structure already exists in the architecture and data — not as an automatic route to general reasoning.
Takeaway
Use scaling as a tool, not a theory of mind. Pair every scaling claim with an out-of-distribution probe.
Glossary
02Evaluation · 4 min
Benchmarks saturate faster than you think
Most public benchmarks have a usable lifespan of 12–24 months before saturation or contamination undermines them.
Once a benchmark is widely cited, it leaks into training corpora, prompts, and synthetic data pipelines. Scores rise for reasons that have nothing to do with the underlying skill.
Long-horizon, adversarial, and held-out-by-construction evaluations age better. Static multiple-choice sets age worst.
When reading a model release, weight evaluations by how easy they would be to contaminate, not by how impressive the headline number is.
Takeaway
Discount benchmark gains that arrive after the benchmark went viral. Trust private, dynamic, and process-graded evals more.
Glossary
03Agents · 4 min
Agents are products of their tools
Most observed differences between agent frameworks reduce to differences in their tool surface, not their reasoning.
Identical base models routed through different tool sets behave very differently. The agent loop is mostly a thin orchestrator on top of retrieval, code execution, and memory.
Reliability gains in 2025–2026 came mainly from better-typed tools, stricter input validation, and structured outputs — not from cleverer planning prompts.
When an agent fails, debug the tools and the contract first, then the model. The model is usually the least repairable part of the stack.
Takeaway
Design the tool layer like an API for a junior engineer who never reads the docs twice.
Glossary
04Safety · 5 min
Alignment is becoming an engineering discipline
Alignment research is shifting from philosophy to measurable engineering: evals, red teams, interpretability dashboards.
Frontier labs now ship alignment work as artefacts — model cards, system cards, capability evaluations, refusal datasets — not just papers.
Mechanistic interpretability has matured into a tooling layer that lets engineers inspect features and circuits, even if a complete theory of internal computation is still missing.
Treat alignment as part of the normal software lifecycle: regressions, CI evals, incident reviews, and post-mortems.
Takeaway
If alignment claims do not come with reproducible evaluations, treat them as marketing.
Glossary
05Architecture · 4 min
The context window is not memory
Long contexts give models more to read, not more to remember. The distinction matters for product design.
A 1M-token context lets a model attend to a lot of text in a single call. It does not give the model persistent identity, learned preferences, or accumulated skill across sessions.
Real memory requires explicit storage, retrieval, and a policy for what to forget. Most production systems still bolt this on with vector stores and structured logs.
Designs that confuse the two end up paying for huge contexts to simulate memory poorly, instead of building memory properly.
Takeaway
Decide which facts must persist across sessions, then store them outside the model. Use context for working memory only.
Glossary
06Human + AI · 5 min
Human judgement is the new bottleneck
As models get cheaper and faster, the scarce resource is the human ability to verify, prioritise, and decide.
Throughput of generated artefacts — code, copy, designs, plans — has outpaced any individual reviewer's capacity to evaluate them carefully.
Teams that win with AI are restructuring workflows around review: smaller diffs, stronger contracts, paired evaluation, escalation rules for high-stakes outputs.
Calibration — knowing when to trust a model — is becoming a first-class skill, distinct from prompting or model selection.
Takeaway
Invest in review pipelines and calibration training, not just in more capable models.
Glossary
07Policy · 5 min
Compute is policy
Export controls, datacenter siting, and energy contracts now shape the frontier as much as research breakthroughs do.
Access to leading-edge accelerators, interconnect, and power is unevenly distributed across countries and firms. That distribution is a policy variable, not a market outcome.
Frontier training runs are sensitive to power availability, cooling, and network topology in ways that look more like heavy industry than software.
Anyone forecasting AGI timelines without a view on compute supply chains is forecasting half the problem.
Takeaway
Read compute and energy policy alongside model releases. They are the same story.
Glossary
08AGI capabilities · 5 min
Generalisation is still the hard part
Modern systems are excellent interpolators inside their training distribution and remain fragile outside it.
Most apparent reasoning successes can be reproduced by retrieval-and-recombine over training data. Genuine extrapolation — solving problems whose structure was not seen — remains rare.
This is why narrow superhuman performance and embarrassing failures coexist in the same model. They are not contradictions; they are evidence of an interpolation engine being asked to extrapolate.
Closing this gap is the central scientific question of AGI, and it is not obviously a scaling problem.
Takeaway
When evaluating a model, design at least one task whose structure could not be in the training data. That is the test that matters.
Glossary

Looking for definitions? See the AGI glossary or the searchable glossary search tool.

Short reads on the AGI transition

Scaling is not understanding

Benchmarks saturate faster than you think

Agents are products of their tools

Alignment is becoming an engineering discipline

The context window is not memory

Human judgement is the new bottleneck

Compute is policy

Generalisation is still the hard part