Skip to main content

Section Technology

Claude Code Auto Mode routes risky tool calls through a Sonnet 4.6 classifier instead of endless taps

Anthropic’s March 2026 engineering deep dive frames Auto Mode as permission automation: a two-stage transcript filter plus a prompt-injection probe, built after internal telemetry showed users accepting 93% of manual prompts anyway.

NewsTenet Technology deskPublished 6 min read
Anthropic wordmark (Wikimedia Commons)—corporate logo only; not a terminal screenshot of Claude Code, a classifier dashboard, or permission prompt UI.

Anthropic’s public engineering note dated 2026-03-25 introduces Auto Mode for Claude Code as a deliberate middle path between noisy per-tool approvals and the nuclear --dangerously-skip-permissions flag: internal telemetry cited in the same article says customers already tap “approve” on roughly 93% of permission prompts, a pattern the team reads as fatigue that erodes attention rather than as proof every action was scrutinized.

Auto Mode therefore targets autonomy with guardrails—letting long sessions continue while model-based classifiers substitute for a human finger on sensitive shell, network, and cross-directory moves.

The architecture is explicitly two-layered. On the way in, a server-side prompt-injection probe inspects tool outputs—file reads, web fetches, shell transcripts, third-party plugin payloads—before they re-enter the agent, adding warnings when content looks like an attempted hijack. On the way out, a transcript classifier running on Sonnet 4.6 evaluates higher-risk actions after cheaper allow tiers clear routine in-repo edits and read-only navigation.

That classifier is deliberately two-stage: a fast single-token “block or allow” pass tuned to err toward blocking, followed by chain-of-thought reasoning only when the first stage raises a flag, so most benign traffic never pays the deeper latency tax.

How allow tiers and the classifier split the work

Tiered defaults described in the engineering write-up mean many coding motions never reach the classifier at all: built-in safe-tool lists, user-configured narrow allow rules, and in-project file writes can proceed without extra model calls, while broad interpreter escapes that granted arbitrary execution under manual mode are stripped when Auto Mode starts so the monitor actually sees dangerous commands.

Anything with real blast radius—shell invocations, web posts, subagent spawns, filesystem touches outside the repo—falls into the transcript classifier bucket, where Anthropic also documents recursive checks at subagent delegation and return to catch privilege expansion and mid-run injection drift.

Honest limits from Anthropic’s own benchmark table

The same March post publishes sobering evaluation numbers on curated datasets: on 10,000 real internal tool calls, stage one alone posted about an 8.5% false-positive rate before the second stage trimmed it to roughly 0.4%, while on 52 real “overeager” incidents the end-to-end pipeline still let through about 17% of dangerous attempts the team had labeled—an explicit warning that Auto Mode is aimed at teams already flirting with fully skipped permissions, not at operators who need human eyes on every production change.

Synthetic exfiltration drills on 1,000 crafted cases showed roughly 5.7% false negatives after both stages, reinforcing that the defense is statistical, not magical.

How practitioners should adopt it without drama

Anthropic pairs the technical detail with operational guidance: treat Auto Mode as a policy object—defaults ship with more than twenty block families covering destructive git moves, credential fishing, silent security downgrades, and pushes across trust boundaries—while encouraging teams to iterate environment slots that declare which GitHub orgs, buckets, and internal APIs count as “inside” versus hostile externals.

Until your org has mirrored those trust boundaries in configuration, the safest mental model is “automation for medium-stakes dev sandboxes first,” then widen coverage as incidents teach the classifier what your codebase considers normal.

Geography and themes

Related places and recurring themes for this story.

Sources and external links

Sources and filings our editors consulted to verify this story. External links open in a new tab.