Section Technology
Claude Code Auto Mode routes risky tool calls through a Sonnet 4.6 classifier instead of endless taps
Anthropic’s March 2026 engineering deep dive frames Auto Mode as permission automation: a two-stage transcript filter plus a prompt-injection probe, built after internal telemetry showed users accepting 93% of manual prompts anyway.
Anthropic’s public engineering note dated 2026-03-25 introduces Auto Mode for Claude Code as a deliberate middle path between noisy per-tool approvals and the nuclear --dangerously-skip-permissions flag: internal telemetry cited in the same article says customers already tap “approve” on roughly 93% of permission prompts, a pattern the team reads as fatigue that erodes attention rather than as proof every action was scrutinized.
Auto Mode therefore targets autonomy with guardrails—letting long sessions continue while model-based classifiers substitute for a human finger on sensitive shell, network, and cross-directory moves.
The architecture is explicitly two-layered. On the way in, a server-side prompt-injection probe inspects tool outputs—file reads, web fetches, shell transcripts, third-party plugin payloads—before they re-enter the agent, adding warnings when content looks like an attempted hijack. On the way out, a transcript classifier running on Sonnet 4.6 evaluates higher-risk actions after cheaper allow tiers clear routine in-repo edits and read-only navigation.
That classifier is deliberately two-stage: a fast single-token “block or allow” pass tuned to err toward blocking, followed by chain-of-thought reasoning only when the first stage raises a flag, so most benign traffic never pays the deeper latency tax.
How allow tiers and the classifier split the work
Tiered defaults described in the engineering write-up mean many coding motions never reach the classifier at all: built-in safe-tool lists, user-configured narrow allow rules, and in-project file writes can proceed without extra model calls, while broad interpreter escapes that granted arbitrary execution under manual mode are stripped when Auto Mode starts so the monitor actually sees dangerous commands.
Anything with real blast radius—shell invocations, web posts, subagent spawns, filesystem touches outside the repo—falls into the transcript classifier bucket, where Anthropic also documents recursive checks at subagent delegation and return to catch privilege expansion and mid-run injection drift.
Honest limits from Anthropic’s own benchmark table
The same March post publishes sobering evaluation numbers on curated datasets: on 10,000 real internal tool calls, stage one alone posted about an 8.5% false-positive rate before the second stage trimmed it to roughly 0.4%, while on 52 real “overeager” incidents the end-to-end pipeline still let through about 17% of dangerous attempts the team had labeled—an explicit warning that Auto Mode is aimed at teams already flirting with fully skipped permissions, not at operators who need human eyes on every production change.
Synthetic exfiltration drills on 1,000 crafted cases showed roughly 5.7% false negatives after both stages, reinforcing that the defense is statistical, not magical.
How practitioners should adopt it without drama
Anthropic pairs the technical detail with operational guidance: treat Auto Mode as a policy object—defaults ship with more than twenty block families covering destructive git moves, credential fishing, silent security downgrades, and pushes across trust boundaries—while encouraging teams to iterate environment slots that declare which GitHub orgs, buckets, and internal APIs count as “inside” versus hostile externals.
Until your org has mirrored those trust boundaries in configuration, the safest mental model is “automation for medium-stakes dev sandboxes first,” then widen coverage as incidents teach the classifier what your codebase considers normal.
Geography and themes
Related places and recurring themes for this story.
Suggested reading
Other stories that pair well with this one—often from the same section or on overlapping themes.
Anthropic’s Q1 2026 growth reads near 80× in markets coverage; Semi Analysis tallies put ARR above $44 billion
Anthropic buys Stainless, the API-to-SDK toolchain rivals including OpenAI and Google relied on
Eric Schmidt booed at University of Arizona commencement when his speech turns to artificial intelligence
Walmart’s six new Onn Android 16 tablets from $97: spec sheet, who they beat, and who should skip them
Oakland jury shuts Musk’s OpenAI fight on a clock question, not the ‘betrayed lab’ plot
Calif’s Mythos-on-M5 kernel exploit story gains an official Apple footnote in macOS Tahoe 26.5 security credits
Microsoft AI chief’s “12–18 months” white-collar forecast: what Mustafa Suleyman actually said
Elon Musk vs Sam Altman trial: what the OpenAI fight is about
Sam Altman testifies in Elon Musk's OpenAI suit as Oakland trial turns to governance and control
Cisco to cut fewer than 4,000 jobs as it doubles down on AI, silicon, and security
Keep exploring
Browse the full archive or return to the front page.
Sources and external links
Sources and filings our editors consulted to verify this story. External links open in a new tab.
- Permission modes — eliminate prompts with Auto Mode (Claude Code docs) (opens in a new tab)— Anthropic (Claude Code docs)