Section Technology

UK AI Security Institute publishes Mythos Preview cyber scores: 73% on expert CTFs, first model to finish a 32-step range in three of ten runs

AISI’s 13 April 2026 write-up summarises controlled evaluations of Anthropic’s Claude Mythos Preview on capture-the-flag tasks and on “The Last Ones,” a 32-step simulated corporate intrusion; Opus 4.6 remains the nearest comparator on the multi-step range but trails on step count.

NewsTenet Technology deskPublished 2026-05-189 min read

Britain’s AI Security Institute (AISI) sits inside the Department for Science, Innovation and Technology as a public research body focused on frontier-model risks—including how capable models behave on cyber tasks when evaluators grant constrained tool access and clear legal rules of engagement.

The evaluation concerns Anthropic’s Claude Mythos Preview: a gated build Anthropic routes mainly through vetted enterprise and coalition channels such as Project Glasswing, rather than the consumer Claude app most readers use for everyday drafting.

On 13 April 2026 AISI published a write-up summarising its own harness work on capture-the-flag (CTF) items and a multi-host corporate range. The desk treats that post as a lab scorecard, not as automatic confirmation of every dramatic zero-day volume claim circulating in parallel trade headlines.

What AISI measured on CTFs

Why the expert-level bar matters

AISI frames expert-level CTFs as isolation tests where, until April 2025, no model in its suite completed the tasks at all. That date matters because it marks when the institute’s published curve still read as a hard floor rather than a gentle slope.

The headline percentage

Against that prior line, AISI now reports Mythos Preview succeeding on 73% of expert-level CTF attempts.

How to read the surrounding charts

The same write-up walks readers through plots that tie token spend to release dates across several vendors’ flagship models, which helps situate Mythos competitively instead of treating 73% as a solitary headline. The institutional caveat still applies: a benchmark harness is not a live national network with blue teams, patch cadence, and mixed-vendor dependencies already in motion.

“The Last Ones” multi-step range

What the exercise models

Beyond single-flag puzzles, AISI describes a 32-step corporate-network simulation called The Last Ones (TLO). It estimates a skilled human would need roughly 20 wall-clock hours to march from early reconnaissance through full scripted takeover. Methodology detail sits in an arXiv paper linked from AISI’s post for readers who want the build before debating “autonomy.”

What Mythos scored

Mythos Preview is, in AISI’s accounting, the first model to finish the full scripted chain, doing so on 3 of 10 attempts. Averaged across attempts it reached about 22 of 32 steps; Claude Opus 4.6 averaged 16 steps on the same chart—a gap that signals a tier shift even before you argue about how well TLO maps to any one company’s LAN.

Why averages still need distributions

Means can hide variance: a cluster of perfect runs can lift an average while most attempts stall mid-chain, so speeches that quote only headline percentages still risk overselling certainty until AISI (or partners) publish fuller spread data alongside the plots.

Limits AISI still flags

Operational technology stayed out of reach

Within the same evaluation wave AISI notes Mythos Preview did not complete an operational-technology–focused cyber range. That single sentence is a useful brake on social-media storylines that compress “model did well on our hardest corporate sim” into “everything is pwned.”

Triangulation beside Anthropic’s 7 April blog

Anthropic’s own 7 April red-team essay stresses in-house exploit statistics and coordinated disclosure volume; AISI’s 13 April note supplies an outside-government read on CTFs and long-horizon ranges. Export-control drafters, insurers, and defence buyers still owe the public case-by-case evidence rather than importing either document wholesale into instant policy.

Geography and themes

Related places and recurring themes for this story.

Suggested reading

Other stories that pair well with this one—often from the same section or on overlapping themes.

Revolut rolls out a physical Dogecoin-branded card in the U.K. and wider EEA

Calif’s Mythos-on-M5 kernel exploit story gains an official Apple footnote in macOS Tahoe 26.5 security credits

Claude Code Auto Mode routes risky tool calls through a Sonnet 4.6 classifier instead of endless taps

Anthropic’s Q1 2026 growth reads near 80× in markets coverage; Semi Analysis tallies put ARR above $44 billion

Anthropic buys Stainless, the API-to-SDK toolchain rivals including OpenAI and Google relied on

Walmart’s six new Onn Android 16 tablets from $97: spec sheet, who they beat, and who should skip them

Eric Schmidt booed at University of Arizona commencement when his speech turns to artificial intelligence

Laptop screen with abstract security-themed imagery, representing Linux kernel patching and local privilege research.

Linux "ssh-keysign-pwn" flaw reportedly lets unprivileged users read root-only files

Cooper presses Gulf partners to reopen Hormuz routes farmers need for fertiliser

China’s chip ‘Big Fund’ said to be in pole position to lead DeepSeek’s first outside raise near a $45 billion tag

Keep exploring

Browse the full archive or return to the front page.

BrowseAll storiesStories grouped by day, with a short summary under each headline.See all stories Front pageToday's front pageHeadlines, latest stories, and section highlights.Go to home

Sources and external links

Sources and filings our editors consulted to verify this story. External links open in a new tab.