← Research

Methodology

The Probe Engine: how we turn agent attacks into audit-ready evidence

A behavioral vulnerability scanner for AI agents — what it attacks, how it scores honestly, and how findings map to AIUC-1.

PharosOne ResearchMethodologyJun 24, 2026 · 12 min

Traditional security scanners check configuration and code. An AI agent's most dangerous failures are behavioral: it can be talked into leaking its system prompt, into calling a dangerous tool with attacker-controlled arguments, into trusting an instruction smuggled inside a document it ingested, or into laundering a record's status past a decision gate. None of these are visible in a static config — they only appear when you drive the agent and watch what it does.

The Probe Engine is a red-teaming harness built for exactly this, on top of Inspect AI. It does three things at once: it attacks — expressing each known vulnerability class as a reusable, agent-agnostic probe, then mutating and repeating it enough times to estimate how often it succeeds; it measures honestly — deciding success with a deterministic oracle, confirming it with an LLM judge, and reporting a rate with a confidence interval rather than a single anecdote; and it maps to a standard — translating raw behavioral findings into the language of AIUC-1, so the output is an audit artifact, not just a list of jailbreaks.

By default a run is fully offline on a deterministic mock tier: no model, no API key, no Docker, no network. Testing a real LLM-backed agent is opt-in. That makes the corpus cheap to develop against and — crucially for certification — fully reproducible.

Data is decoupled from code. Attacks, the standard, and the crosswalk between them are all data, edited independently of the engine. Adding an attack is a new file; adding a standard is a new framework plus crosswalk; supporting a new agent framework is a new adapter. Every probe is validated against a strict schema and rejected on duplicate ids, so malformed data fails fast and offline.

The corpus is deliberately agent-agnostic. No probe names a specific customer agent. Probes target canonical capabilities (like status_change or deploy) and canonical channels (like an ingested record), which a target's profile maps onto its real tools and surfaces. The same corpus tests an e-commerce support agent, a code-writing agent, and a finance agent — selection decides which probes apply.

Variation is the part most worth understanding. A single probe describes one attack idea; variation is how that idea becomes a battery of distinct attempts that actually stresses a defended agent. Every produced attack is a point in a three-axis space — the framing (authority override, urgency, last-rule-wins, foreign-language laundering), the payload (the actual malicious intent), and the obfuscation (homoglyphs, zero-width splices, base64 wrapping, bidi controls). The full product of those axes is astronomically large and the trial budget is tiny.

You always sample far less than one percent of the space — so the design question is not how to enumerate it, but which few points genuinely stress a defended agent.

The default policy is a curated, deterministic table of frame-and-obfuscation pairings. Even-numbered variants draw a plain family — clean, fluent, in-domain rewrites with no character garbling, because many agents survive garbled payloads but fall to a polished reframing — and odd-numbered variants draw an obfuscated family. All randomness flows through a seeded generator, so the same seed produces byte-identical output. An optional LLM strategy reuses the exact same skeleton to produce open-ended, in-domain rewrites, but falls back to the deterministic path on any error.

Because the engine deliberately mutates and obfuscates attack text, the literal payload may not survive into the agent's reply — so a naive substring check becomes unreliable the moment variation kicks in. The resolution: the semantic invariant lives in the oracle and the judge, never in the prompt text. Scoring is two-staged. A deterministic oracle runs on every trial and computes a provisional hit; then a batch judge reads the actual transcripts and overwrites each verdict. The binary oracle is the structural floor; the judge, when present, is the authoritative ceiling.

Trials fold into Evidence: an attack-success rate plus a Wilson confidence interval, well-behaved for small samples and extreme proportions. A probe fails if either the point estimate or the Wilson upper bound crosses the pass threshold — failing on the upper bound is what turns 'zero hits so far' into a defensible claim. With zero observed successes, roughly 36 trials certify the true success rate at or below 10%, and roughly 75 trials at or below 5%.

Honesty is enforced structurally, not by convention. Blind spots are never silent passes: a probe whose required capability or channel a target does not declare is skipped and surfaced in the report, never re-routed onto another channel and never counted as robust. And degradation is loud: if a configured judge is unavailable, or a false-positive-prone oracle runs without a judge, the evidence is marked UNVERIFIED rather than read as a confident verdict. Coverage honesty extends to the standard itself — a control that is not behaviorally testable is marked as such, never 'failed' and never 'uncovered'.

When the engine cannot be sure, it says so. A blind spot is surfaced, not skipped; an unverifiable result reads UNVERIFIED, never a confident pass.

The mapping layer turns behavioral evidence into control-standard coverage. A versioned crosswalk maps each taxonomy coordinate (ATLAS, OWASP Agentic, CWE) to one or more AIUC-1 controls; entries are research-derived, flagged for expert review, and use the real control wordings rather than invented ones. The coverage engine requires diversity of evidence — several independent probes, not just one — before a control counts as covered.

Finally, the boundary between customer and PharosOne is drawn so secrets never have to move. The client projects its secret-bearing config down to a manifest carrying only the selection facts — it structurally drops the system prompt, protected snippets, API key, endpoint, and model ids. After the local run, the client uploads stats-only evidence with no transcript and no trial fields at all, so the proof conversation cannot leave the customer. The server is the single standards-mapping authority; corpus slices are cryptographically signed and verified on the way in.

The corpus today holds 118 probe specs across 8 canonical channels, 14 oracle kinds, and 13 obfuscators, mapped through 43 crosswalk entries onto a 49-control AIUC-1 framework. Every number on this site links back to the inputs that produced it — versioned, reproducible, and one click from the methodology.

The semantic invariant lives in the oracle and the judge, never in the prompt text. That is what makes a diversified, obfuscated attack safe to score.

Configurations like yours — generic results describe the population, not your specific agent.

Want this for your actual agent?