Multi-tier prompt injection detection for autonomous systems. Three independent detection methods compose into a weighted ensemble that returns actionable verdicts with confidence scores — not just pass/fail, but why and how confident.
Any agent that processes untrusted input is vulnerable to prompt injection. A single-pass classifier is not enough — sophisticated attacks fragment instructions across turns or mimic legitimate authority patterns. Warden uses three independent detection tiers to catch what any single method would miss, and returns structured verdicts your agent can act on programmatically: confidence scores, threat categories, and specific evidence.
Every tier is a standalone tool — call warden.canary
alone for fast protocol checks, warden.intent
for semantic analysis, or warden.adjudicate
for deep adversarial
evaluation. When you need the full pipeline, warden.ensemble
runs all three and returns
a single weighted verdict. Sessions are optional: pass a session_id
to track suspicion
across conversation turns, or omit it for stateless single-shot checks.
Each tier returns structured output your agent can branch on: a composite confidence score (0–1), categorized threat types, and specific evidence excerpts. Use the confidence threshold to control sensitivity — strict for financial operations, lenient for casual interactions. The ensemble combines all three tiers into a single weighted verdict so your agent makes one decision, not three.
// Place Warden as a gate step inside flow.into { "plan": [ { "id": "check", "tool": "data-grout@1/warden.ensemble@1", "args": { "content": "$input.user_message", "on_fail": "block" } }, { "id": "execute", "tool": "quickbooks@1/create-invoice@1", "args": { "...": "..." } } ] }
If the check step detects an attack, the workflow halts before reaching the downstream tool. The verdict includes enough detail for logging, alerting, or escalation to a human reviewer.
Warden's semantic analysis tiers use AI on the first evaluation, but detection patterns and adjudication rules are cached. Repeated content patterns resolve faster as the system builds familiarity with your traffic. Multi-turn session tracking accumulates context across conversation turns without re-analyzing prior messages from scratch.
We can't find the internet
Something went wrong!