Skip to main content

Section Technology

UK AI Security Institute publishes Mythos Preview cyber scores: 73% on expert CTFs, first model to finish a 32-step range in three of ten runs

AISI’s 13 April 2026 write-up summarises controlled evaluations of Anthropic’s Claude Mythos Preview on capture-the-flag tasks and on “The Last Ones,” a 32-step simulated corporate intrusion; Opus 4.6 remains the nearest comparator on the multi-step range but trails on step count.

NewsTenet Technology deskPublished 9 min read
Palace of Westminster and Elizabeth Tower from the Thames (Wikimedia Commons)—UK government quarter context for the Department for Science, Innovation and Technology’s AI Security Institute, which published the evaluation; not an AISI lab photo or model inference screenshot.

Britain’s AI Security Institute (AISI) sits inside the Department for Science, Innovation and Technology as a public research body focused on frontier-model risks—including how capable models behave on cyber tasks when evaluators grant constrained tool access and clear legal rules of engagement.

The evaluation concerns Anthropic’s Claude Mythos Preview: a gated build Anthropic routes mainly through vetted enterprise and coalition channels such as Project Glasswing, rather than the consumer Claude app most readers use for everyday drafting.

On 13 April 2026 AISI published a write-up summarising its own harness work on capture-the-flag (CTF) items and a multi-host corporate range. The desk treats that post as a lab scorecard, not as automatic confirmation of every dramatic zero-day volume claim circulating in parallel trade headlines.

What AISI measured on CTFs

Why the expert-level bar matters

AISI frames expert-level CTFs as isolation tests where, until April 2025, no model in its suite completed the tasks at all. That date matters because it marks when the institute’s published curve still read as a hard floor rather than a gentle slope.

The headline percentage

Against that prior line, AISI now reports Mythos Preview succeeding on 73% of expert-level CTF attempts.

How to read the surrounding charts

The same write-up walks readers through plots that tie token spend to release dates across several vendors’ flagship models, which helps situate Mythos competitively instead of treating 73% as a solitary headline. The institutional caveat still applies: a benchmark harness is not a live national network with blue teams, patch cadence, and mixed-vendor dependencies already in motion.

“The Last Ones” multi-step range

What the exercise models

Beyond single-flag puzzles, AISI describes a 32-step corporate-network simulation called The Last Ones (TLO). It estimates a skilled human would need roughly 20 wall-clock hours to march from early reconnaissance through full scripted takeover. Methodology detail sits in an arXiv paper linked from AISI’s post for readers who want the build before debating “autonomy.”

What Mythos scored

Mythos Preview is, in AISI’s accounting, the first model to finish the full scripted chain, doing so on 3 of 10 attempts. Averaged across attempts it reached about 22 of 32 steps; Claude Opus 4.6 averaged 16 steps on the same chart—a gap that signals a tier shift even before you argue about how well TLO maps to any one company’s LAN.

Why averages still need distributions

Means can hide variance: a cluster of perfect runs can lift an average while most attempts stall mid-chain, so speeches that quote only headline percentages still risk overselling certainty until AISI (or partners) publish fuller spread data alongside the plots.

Limits AISI still flags

Operational technology stayed out of reach

Within the same evaluation wave AISI notes Mythos Preview did not complete an operational-technology–focused cyber range. That single sentence is a useful brake on social-media storylines that compress “model did well on our hardest corporate sim” into “everything is pwned.”

Triangulation beside Anthropic’s 7 April blog

Anthropic’s own 7 April red-team essay stresses in-house exploit statistics and coordinated disclosure volume; AISI’s 13 April note supplies an outside-government read on CTFs and long-horizon ranges. Export-control drafters, insurers, and defence buyers still owe the public case-by-case evidence rather than importing either document wholesale into instant policy.

Geography and themes

Related places and recurring themes for this story.

Sources and external links

Sources and filings our editors consulted to verify this story. External links open in a new tab.