NoticeThis site demonstrates one possible use of this domain. For acquisition, partnership, or investment inquiries, please use our contact form.

Insights

Short reads on the AGI transition

Eight curated, opinionated takes on capabilities, evaluation, agents, alignment, and policy — each under five minutes.

  1. 01AGI capabilities · 4 min

    Scaling is not understanding

    Bigger models keep getting better at benchmarks, but capability and comprehension are different axes.

    Scaling laws describe how loss falls predictably as parameters, data, and compute grow. They do not describe whether a model has a coherent world model, only that next-token prediction error shrinks.

    Capability jumps that look like understanding often turn out to be sharper pattern completion on tasks well covered by training data. Carefully held-out probes typically widen the gap again.

    Treat scaling as a force multiplier for whatever cognitive structure already exists in the architecture and data — not as an automatic route to general reasoning.

    Takeaway

    Use scaling as a tool, not a theory of mind. Pair every scaling claim with an out-of-distribution probe.

  2. 02Evaluation · 4 min

    Benchmarks saturate faster than you think

    Most public benchmarks have a usable lifespan of 12–24 months before saturation or contamination undermines them.

    Once a benchmark is widely cited, it leaks into training corpora, prompts, and synthetic data pipelines. Scores rise for reasons that have nothing to do with the underlying skill.

    Long-horizon, adversarial, and held-out-by-construction evaluations age better. Static multiple-choice sets age worst.

    When reading a model release, weight evaluations by how easy they would be to contaminate, not by how impressive the headline number is.

    Takeaway

    Discount benchmark gains that arrive after the benchmark went viral. Trust private, dynamic, and process-graded evals more.

  3. 03Agents · 4 min

    Agents are products of their tools

    Most observed differences between agent frameworks reduce to differences in their tool surface, not their reasoning.

    Identical base models routed through different tool sets behave very differently. The agent loop is mostly a thin orchestrator on top of retrieval, code execution, and memory.

    Reliability gains in 2025–2026 came mainly from better-typed tools, stricter input validation, and structured outputs — not from cleverer planning prompts.

    When an agent fails, debug the tools and the contract first, then the model. The model is usually the least repairable part of the stack.

    Takeaway

    Design the tool layer like an API for a junior engineer who never reads the docs twice.

  4. 04Safety · 5 min

    Alignment is becoming an engineering discipline

    Alignment research is shifting from philosophy to measurable engineering: evals, red teams, interpretability dashboards.

    Frontier labs now ship alignment work as artefacts — model cards, system cards, capability evaluations, refusal datasets — not just papers.

    Mechanistic interpretability has matured into a tooling layer that lets engineers inspect features and circuits, even if a complete theory of internal computation is still missing.

    Treat alignment as part of the normal software lifecycle: regressions, CI evals, incident reviews, and post-mortems.

    Takeaway

    If alignment claims do not come with reproducible evaluations, treat them as marketing.

  5. 05Architecture · 4 min

    The context window is not memory

    Long contexts give models more to read, not more to remember. The distinction matters for product design.

    A 1M-token context lets a model attend to a lot of text in a single call. It does not give the model persistent identity, learned preferences, or accumulated skill across sessions.

    Real memory requires explicit storage, retrieval, and a policy for what to forget. Most production systems still bolt this on with vector stores and structured logs.

    Designs that confuse the two end up paying for huge contexts to simulate memory poorly, instead of building memory properly.

    Takeaway

    Decide which facts must persist across sessions, then store them outside the model. Use context for working memory only.

  6. 06Human + AI · 5 min

    Human judgement is the new bottleneck

    As models get cheaper and faster, the scarce resource is the human ability to verify, prioritise, and decide.

    Throughput of generated artefacts — code, copy, designs, plans — has outpaced any individual reviewer's capacity to evaluate them carefully.

    Teams that win with AI are restructuring workflows around review: smaller diffs, stronger contracts, paired evaluation, escalation rules for high-stakes outputs.

    Calibration — knowing when to trust a model — is becoming a first-class skill, distinct from prompting or model selection.

    Takeaway

    Invest in review pipelines and calibration training, not just in more capable models.

  7. 07Policy · 5 min

    Compute is policy

    Export controls, datacenter siting, and energy contracts now shape the frontier as much as research breakthroughs do.

    Access to leading-edge accelerators, interconnect, and power is unevenly distributed across countries and firms. That distribution is a policy variable, not a market outcome.

    Frontier training runs are sensitive to power availability, cooling, and network topology in ways that look more like heavy industry than software.

    Anyone forecasting AGI timelines without a view on compute supply chains is forecasting half the problem.

    Takeaway

    Read compute and energy policy alongside model releases. They are the same story.

  8. 08AGI capabilities · 5 min

    Generalisation is still the hard part

    Modern systems are excellent interpolators inside their training distribution and remain fragile outside it.

    Most apparent reasoning successes can be reproduced by retrieval-and-recombine over training data. Genuine extrapolation — solving problems whose structure was not seen — remains rare.

    This is why narrow superhuman performance and embarrassing failures coexist in the same model. They are not contradictions; they are evidence of an interpolation engine being asked to extrapolate.

    Closing this gap is the central scientific question of AGI, and it is not obviously a scaling problem.

    Takeaway

    When evaluating a model, design at least one task whose structure could not be in the training data. That is the test that matters.

Looking for definitions? See the AGI glossary or the searchable glossary search tool.