Skip to main content
WardenOpen-source AI scannerExplore →
Layer 4 · Adaptive Immune System

Agents go rogue. Don't just block them. Quarantine them.

Inline blocking is a binary. Adaptive Defense is a continuum — escalate monitoring when behavior drifts, inject canaries when sensitive data flows, quarantine when provenance is violated. Five stages. Three modes. One outcome: the agent stops before damage.

Adaptive Defense is two coordinated subsystems. Canary Injection embeds zero-width markers — invisible to the LLM, meaningless to a human, but cryptographically traceable — into tool outputs that touch sensitive data. When those markers reappear in the arguments of an outbound tool call, provenance has been violated and the agent is exfiltrating. Quarantine Intensity is the response system: monitoring escalates through three modes (Normal → Enhanced → Lockdown), each with a defined action set. The Behavioral Causality Engine drives the escalation by classifying every anomaly as LIKELY_ATTACK, LIKELY_HALLUCINATION, DEGRADATION, or INCONCLUSIVE.

The pattern

From normal behavior to operator review.

One canonical sequence. Most agent sessions stay at stage 01. The ones that escalate traverse all five — visibly, auditably, signed at every step.

01

Normal

Agent operates within its established behavioral envelope. Tool calls traced; behavioral baselines (EWMA per agent × per tool) updated continuously. No restrictions.

BASELINE
02

Anomaly Detected

Behavioral Causality Engine flags an unexplained sequence — argument distribution outside baseline, unexpected tool combination, time-of-day deviation. Monitoring intensity bumps NORMAL → ENHANCED. All calls now log full payloads; canary tokens injected into responses.

ENHANCED
03

Canary Triggered

A zero-width canary marker — invisible to the LLM, embedded in earlier tool output — appears in the arguments of a destination tool. Provenance violated. The agent is reading sensitive data and trying to send it somewhere unauthorized.

ENHANCED
04

Quarantine Activated

Critical canary classification triggers LOCKDOWN. Agent flips to read-only mode. JIT ephemeral tokens revoked. Optional context scrub purges the agent's working memory of the sensitive data. CISO alert dispatched. Audit chain entry signed.

LOCKDOWN
05

Operator Review

Quarantine state persists until operator reviews the audit chain. Three options: (a) clear and resume — false positive; (b) terminate the session and re-credential the agent; (c) escalate to incident response with the full provenance graph attached.

LOCKDOWN
01 · Canary injection

Zero-width markers track every byte.

When a tool call returns sensitive data — customer records, financial figures, credentials — the response is augmented with a cryptographically-derived zero-width marker before it reaches the agent.

The marker is invisible to the LLM (it parses past it without acknowledgment) and meaningless to a human reading the output. But if it ever appears in the argument of a downstream tool call — say, an agent attempting to http_post a body containing the marker — the proxy detects provenance violation immediately.

Markers are scoped to the originating call so the proxy can answer not just did exfiltration occur, but which tool call was the source, and which destination tool was the egress. Full chain-of-custody.

02 · Quarantine intensity

Three modes. One escalation.

NORMAL
Baseline behavioral monitoring. EWMA per agent × tool.
ENHANCED
Full payload logging. Canary injection on every sensitive output. Behavioral re-baseline.
LOCKDOWN
Read-only mode. JIT revoke. Optional context scrub. CISO page. Operator review required.

Mode escalation is one-way without operator intervention. ENHANCED can step up to LOCKDOWN automatically; stepping down requires a signed audit-chain entry from a human.

03 · Causality classification

Not every anomaly is an attack.

The Behavioral Causality Engine classifies each anomaly into one of five categories. Only two trigger active mode change. The rest are logged for trend analysis. False positives are expensive — escalation is restrained by design.

ClassificationResponse modeAction
LIKELY_ATTACKLockdownRead-only flip · JIT revoke · context scrub · CISO page · audit chain entry.
LIKELY_HALLUCINATIONEnhancedFull payload logging · canary injection · behavioral re-baseline · operator notification.
DEGRADATIONEnhancedFull payload logging · canary injection · behavioral re-baseline.
POSSIBLE_HALLUCINATIONNo-op (observe)Logged for trend analysis. No mode change. Operator dashboard tile updated.
INCONCLUSIVENo-op (observe)Logged for trend analysis. No mode change.

Compromise happens.
Damage doesn't have to.

ToolGuard blocks. Adaptive Defense quarantines. The first stops the call. The second stops the agent.

We use cookies for analytics to understand how visitors use our site. No advertising cookies. Privacy Policy