Section Technology
UK AI Security Institute publishes Mythos Preview cyber scores: 73% on expert CTFs, first model to finish a 32-step range in three of ten runs
AISI’s 13 April 2026 write-up summarises controlled evaluations of Anthropic’s Claude Mythos Preview on capture-the-flag tasks and on “The Last Ones,” a 32-step simulated corporate intrusion; Opus 4.6 remains the nearest comparator on the multi-step range but trails on step count.
Britain’s AI Security Institute (AISI) sits inside the Department for Science, Innovation and Technology as a public research body focused on frontier-model risks—including how capable models behave on cyber tasks when evaluators grant constrained tool access and clear legal rules of engagement.
The evaluation concerns Anthropic’s Claude Mythos Preview: a gated build Anthropic routes mainly through vetted enterprise and coalition channels such as Project Glasswing, rather than the consumer Claude app most readers use for everyday drafting.
On 13 April 2026 AISI published a write-up summarising its own harness work on capture-the-flag (CTF) items and a multi-host corporate range. The desk treats that post as a lab scorecard, not as automatic confirmation of every dramatic zero-day volume claim circulating in parallel trade headlines.
What AISI measured on CTFs
Why the expert-level bar matters
AISI frames expert-level CTFs as isolation tests where, until April 2025, no model in its suite completed the tasks at all. That date matters because it marks when the institute’s published curve still read as a hard floor rather than a gentle slope.
The headline percentage
Against that prior line, AISI now reports Mythos Preview succeeding on 73% of expert-level CTF attempts.
How to read the surrounding charts
The same write-up walks readers through plots that tie token spend to release dates across several vendors’ flagship models, which helps situate Mythos competitively instead of treating 73% as a solitary headline. The institutional caveat still applies: a benchmark harness is not a live national network with blue teams, patch cadence, and mixed-vendor dependencies already in motion.
“The Last Ones” multi-step range
What the exercise models
Beyond single-flag puzzles, AISI describes a 32-step corporate-network simulation called The Last Ones (TLO). It estimates a skilled human would need roughly 20 wall-clock hours to march from early reconnaissance through full scripted takeover. Methodology detail sits in an arXiv paper linked from AISI’s post for readers who want the build before debating “autonomy.”
What Mythos scored
Mythos Preview is, in AISI’s accounting, the first model to finish the full scripted chain, doing so on 3 of 10 attempts. Averaged across attempts it reached about 22 of 32 steps; Claude Opus 4.6 averaged 16 steps on the same chart—a gap that signals a tier shift even before you argue about how well TLO maps to any one company’s LAN.
Why averages still need distributions
Means can hide variance: a cluster of perfect runs can lift an average while most attempts stall mid-chain, so speeches that quote only headline percentages still risk overselling certainty until AISI (or partners) publish fuller spread data alongside the plots.
Limits AISI still flags
Operational technology stayed out of reach
Within the same evaluation wave AISI notes Mythos Preview did not complete an operational-technology–focused cyber range. That single sentence is a useful brake on social-media storylines that compress “model did well on our hardest corporate sim” into “everything is pwned.”
Triangulation beside Anthropic’s 7 April blog
Anthropic’s own 7 April red-team essay stresses in-house exploit statistics and coordinated disclosure volume; AISI’s 13 April note supplies an outside-government read on CTFs and long-horizon ranges. Export-control drafters, insurers, and defence buyers still owe the public case-by-case evidence rather than importing either document wholesale into instant policy.
Geography and themes
Related places and recurring themes for this story.
- United Kingdom
- Technology
- Cybersecurity
Suggested reading
Other stories that pair well with this one—often from the same section or on overlapping themes.

Revolut rolls out a physical Dogecoin-branded card in the U.K. and wider EEA
Calif’s Mythos-on-M5 kernel exploit story gains an official Apple footnote in macOS Tahoe 26.5 security credits
Claude Code Auto Mode routes risky tool calls through a Sonnet 4.6 classifier instead of endless taps
Anthropic’s Q1 2026 growth reads near 80× in markets coverage; Semi Analysis tallies put ARR above $44 billion
Anthropic buys Stainless, the API-to-SDK toolchain rivals including OpenAI and Google relied on
Walmart’s six new Onn Android 16 tablets from $97: spec sheet, who they beat, and who should skip them
Eric Schmidt booed at University of Arizona commencement when his speech turns to artificial intelligence
Linux "ssh-keysign-pwn" flaw reportedly lets unprivileged users read root-only files
Cooper presses Gulf partners to reopen Hormuz routes farmers need for fertiliser
China’s chip ‘Big Fund’ said to be in pole position to lead DeepSeek’s first outside raise near a $45 billion tag
Keep exploring
Browse the full archive or return to the front page.
Sources and external links
Sources and filings our editors consulted to verify this story. External links open in a new tab.
- Our evaluation of Claude Mythos Preview’s cyber capabilities (AI Security Institute) (opens in a new tab)— AI Security Institute (UK)