← All tools

Warden Tools

Multi-tier prompt injection detection for autonomous systems. Three independent detection methods compose into a weighted ensemble that returns actionable verdicts with confidence scores — not just pass/fail, but why and how confident.

Why this matters

Any agent that processes untrusted input is vulnerable to prompt injection. A single-pass classifier is not enough — sophisticated attacks fragment instructions across turns or mimic legitimate authority patterns. Warden uses three independent detection tiers to catch what any single method would miss, and returns structured verdicts your agent can act on programmatically: confidence scores, threat categories, and specific evidence.

Three detection tiers

  • T1warden.canary — Protocol adherence: runs verification tasks that injected instructions will interfere with, producing detectable failures
  • T2warden.intent — Semantic analysis: surfaces what the content is trying to accomplish and compares it against your declared execution context
  • T3warden.adjudicate — Adversarial adjudication: extracts structured threat facts, evaluates them against a symbolic rule engine, and returns categorized evidence
  • Σwarden.ensemble — All three tiers in one call with weighted composite scoring, confidence levels, and multi-turn session tracking

Use each tier solo or combine them

Every tier is a standalone tool — call warden.canary alone for fast protocol checks, warden.intent for semantic analysis, or warden.adjudicate for deep adversarial evaluation. When you need the full pipeline, warden.ensemble runs all three and returns a single weighted verdict. Sessions are optional: pass a session_id to track suspicion across conversation turns, or omit it for stateless single-shot checks.

Actionable verdicts

Each tier returns structured output your agent can branch on: a composite confidence score (0–1), categorized threat types, and specific evidence excerpts. Use the confidence threshold to control sensitivity — strict for financial operations, lenient for casual interactions. The ensemble combines all three tiers into a single weighted verdict so your agent makes one decision, not three.

Example: use as a workflow gate

// Place Warden as a gate step inside flow.into
{
  "plan": [
    {
      "id": "check",
      "tool": "data-grout@1/warden.ensemble@1",
      "args": { "content": "$input.user_message", "on_fail": "block" }
    },
    {
      "id": "execute",
      "tool": "quickbooks@1/create-invoice@1",
      "args": { "...": "..." }
    }
  ]
}

If the check step detects an attack, the workflow halts before reaching the downstream tool. The verdict includes enough detail for logging, alerting, or escalation to a human reviewer.

Progressive efficiency

Warden's semantic analysis tiers use AI on the first evaluation, but detection patterns and adjudication rules are cached. Repeated content patterns resolve faster as the system builds familiarity with your traffic. Multi-turn session tracking accumulates context across conversation turns without re-analyzing prior messages from scratch.

Composes with

Insert as a gate step in any Flow workflow — Warden checks the input, and flow.route branches based on the verdict confidence. Use multi-turn sessions with Logic tools to accumulate suspicion across conversation turns and trigger escalation when a threshold is crossed.