Section Technology

Claude Code Auto Mode routes risky tool calls through a Sonnet 4.6 classifier instead of endless taps

Anthropic’s March 2026 engineering deep dive frames Auto Mode as permission automation: a two-stage transcript filter plus a prompt-injection probe, built after internal telemetry showed users accepting 93% of manual prompts anyway.

NewsTenet Technology deskPublished 2026-05-186 min read

Anthropic’s public engineering note dated 2026-03-25 introduces Auto Mode for Claude Code as a deliberate middle path between noisy per-tool approvals and the nuclear --dangerously-skip-permissions flag: internal telemetry cited in the same article says customers already tap “approve” on roughly 93% of permission prompts, a pattern the team reads as fatigue that erodes attention rather than as proof every action was scrutinized.

Auto Mode therefore targets autonomy with guardrails—letting long sessions continue while model-based classifiers substitute for a human finger on sensitive shell, network, and cross-directory moves.

The architecture is explicitly two-layered. On the way in, a server-side prompt-injection probe inspects tool outputs—file reads, web fetches, shell transcripts, third-party plugin payloads—before they re-enter the agent, adding warnings when content looks like an attempted hijack. On the way out, a transcript classifier running on Sonnet 4.6 evaluates higher-risk actions after cheaper allow tiers clear routine in-repo edits and read-only navigation.

That classifier is deliberately two-stage: a fast single-token “block or allow” pass tuned to err toward blocking, followed by chain-of-thought reasoning only when the first stage raises a flag, so most benign traffic never pays the deeper latency tax.

How allow tiers and the classifier split the work

Tiered defaults described in the engineering write-up mean many coding motions never reach the classifier at all: built-in safe-tool lists, user-configured narrow allow rules, and in-project file writes can proceed without extra model calls, while broad interpreter escapes that granted arbitrary execution under manual mode are stripped when Auto Mode starts so the monitor actually sees dangerous commands.

Anything with real blast radius—shell invocations, web posts, subagent spawns, filesystem touches outside the repo—falls into the transcript classifier bucket, where Anthropic also documents recursive checks at subagent delegation and return to catch privilege expansion and mid-run injection drift.

Honest limits from Anthropic’s own benchmark table

The same March post publishes sobering evaluation numbers on curated datasets: on 10,000 real internal tool calls, stage one alone posted about an 8.5% false-positive rate before the second stage trimmed it to roughly 0.4%, while on 52 real “overeager” incidents the end-to-end pipeline still let through about 17% of dangerous attempts the team had labeled—an explicit warning that Auto Mode is aimed at teams already flirting with fully skipped permissions, not at operators who need human eyes on every production change.

Synthetic exfiltration drills on 1,000 crafted cases showed roughly 5.7% false negatives after both stages, reinforcing that the defense is statistical, not magical.

How practitioners should adopt it without drama

Anthropic pairs the technical detail with operational guidance: treat Auto Mode as a policy object—defaults ship with more than twenty block families covering destructive git moves, credential fishing, silent security downgrades, and pushes across trust boundaries—while encouraging teams to iterate environment slots that declare which GitHub orgs, buckets, and internal APIs count as “inside” versus hostile externals.

Until your org has mirrored those trust boundaries in configuration, the safest mental model is “automation for medium-stakes dev sandboxes first,” then widen coverage as incidents teach the classifier what your codebase considers normal.

Geography and themes

Related places and recurring themes for this story.

Sources and external links

Sources and filings our editors consulted to verify this story. External links open in a new tab.

Claude Code auto mode: a safer way to skip permissions (Anthropic Engineering, Mar 2026) (opens in a new tab)— Anthropic
Permission modes — eliminate prompts with Auto Mode (Claude Code docs) (opens in a new tab)— Anthropic (Claude Code docs)
Auto mode for Claude Code (Claude product blog) (opens in a new tab)— Claude

Claude Code Auto Mode routes risky tool calls through a Sonnet 4.6 classifier instead of endless taps

How allow tiers and the classifier split the work

Honest limits from Anthropic’s own benchmark table

How practitioners should adopt it without drama

Geography and themes

Suggested reading

Google I/O 2026 Pushes Always-On Gemini Agent

Anthropic’s Q1 2026 growth reads near 80× in markets coverage; Semi Analysis tallies put ARR above $44 billion

Anthropic buys Stainless, the API-to-SDK toolchain rivals including OpenAI and Google relied on

Eric Schmidt booed at University of Arizona commencement when his speech turns to artificial intelligence

Google CLI Links OpenClaw to Gmail Unsupported

Walmart’s six new Onn Android 16 tablets from $97: spec sheet, who they beat, and who should skip them

Oakland jury shuts Musk’s OpenAI fight on a clock question, not the ‘betrayed lab’ plot

Calif’s Mythos-on-M5 kernel exploit story gains an official Apple footnote in macOS Tahoe 26.5 security credits

Microsoft AI chief’s “12–18 months” white-collar forecast: what Mustafa Suleyman actually said

Elon Musk vs Sam Altman trial: what the OpenAI fight is about

Keep exploring

Sources and external links