Command Palette
Search for a command to run...
Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening
Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening
Abstract
As large language models (LLMs) evolve into autonomous agents, their real-world applicability has expanded significantly, accompanied by new security challenges. Most existing agent defense mechanisms adopt a mandatory checking paradigm, in which security validation is forcibly triggered at predefined stages of the agent lifecycle. In this work, we argue that effective agent security should be intrinsic and selective rather than architecturally decoupled and mandatory. We propose Spider-Sense framework, an event-driven defense framework based on Intrinsic Risk Sensing (IRS), which allows agents to maintain latent vigilance and trigger defenses only upon risk perception. Once triggered, the Spider-Sense invokes a hierarchical defence mechanism that trades off efficiency and precision: it resolves known patterns via lightweight similarity matching while escalating ambiguous cases to deep internal reasoning, thereby eliminating reliance on external models. To facilitate rigorous evaluation, we introduce S^2Bench, a lifecycle-aware benchmark featuring realistic tool execution and multi-stage attacks. Extensive experiments demonstrate that Spider-Sense achieves competitive or superior defense performance, attaining the lowest Attack Success Rate (ASR) and False Positive Rate (FPR), with only a marginal latency overhead of 8.3%.
One-sentence Summary
Researchers from SUFE, NUS, QuantaAlpha, and others propose SPIDER-SENSE, an event-driven agent defense framework using Intrinsic Risk Sensing to trigger selective, hierarchical security responses—reducing reliance on external models—validated via S²Bench, achieving low attack success and false positive rates with minimal latency.
Key Contributions
- SPIDER-SENSE introduces Intrinsic Risk Sensing (IRS), an event-driven defense mechanism that embeds risk awareness directly into the agent’s execution flow, enabling selective security checks only when anomalies are detected, rather than enforcing mandatory checks at every lifecycle stage.
- The framework employs a Hierarchical Adaptive Screening (HAC) system that first uses lightweight similarity matching against attack vector databases for known threats, then escalates ambiguous cases to internal deep reasoning—eliminating reliance on external models while balancing efficiency and precision.
- Evaluated on the novel S²Bench benchmark with realistic tool execution and multi-stage attacks, SPIDER-SENSE achieves state-of-the-art defense performance with the lowest Attack Success Rate and False Positive Rate, while adding only 8.3% latency overhead.
Introduction
The authors leverage the growing deployment of LLM-powered autonomous agents—used in finance, coding, and web automation—to address a critical gap: existing security defenses impose mandatory, stage-wise checks that add latency and rely on external models, making them impractical for real-world, multi-step workflows. Prior work often treats safety as an external layer, triggering checks at every step regardless of actual risk, which degrades performance and increases false positives. The authors’ main contribution is SPIDER-SENSE, a framework that embeds Intrinsic Risk Sensing (IRS) directly into the agent’s execution flow, enabling event-driven defense only when anomalies are detected. It pairs this with Hierarchical Adaptive Screening (HAC), which first uses fast similarity matching to handle known threats and escalates ambiguous cases to deeper reasoning—all without external verifiers. They also introduce S²Bench, a lifecycle-aware benchmark with realistic tool executions and multi-stage attacks, to rigorously evaluate in-situ interception. Experiments show SPIDER-SENSE achieves state-of-the-art defense with the lowest attack success and false positive rates, adding only 8.3% latency overhead.
Dataset

The authors use S²Bench, a custom-built, multi-stage, multi-scenario dataset designed to evaluate LLM-powered agents under realistic, dynamic attack conditions. Key details:
-
Composition & Sources:
- Covers 4 agent execution stages and 8 application domains, with 79 fine-grained sub-task scenarios.
- Built around a 300-function tool library and 100+ realistic tool return types that mimic actual structured outputs (not placeholders).
- Includes 153 hard benign prompts that mimic attack structure but are harmless—used to test for false positives.
- Attack content is crafted per stage: malicious inputs in planning, poisoned info in retrieval, etc.
-
Processing & Simulation:
- Uses an external Attack Simulation Injector that manipulates tool outputs and memory during agent execution without altering internal logic.
- Injects state-dependent attacks (e.g., replacing benign returns with privilege escalation signals) to trigger realistic downstream deviations.
- Metadata includes scenario tags (e.g., “Fictitious Scenario, Gamification”) and stored pattern identifiers (e.g., “stage_1_23”) for traceability.
-
Usage in Evaluation:
- Designed for end-to-end testing of defense systems under dynamic, real-world agent workflows.
- Focuses on the perception–decision–action loop, moving beyond static text injection or simulated outputs.
- Enables assessment of whether defenses correctly distinguish subtle intent differences without blocking legitimate behavior.
Method
The SPIDER-SENSE framework introduces a self-driven security architecture for LLM-based agents, enabling autonomous risk detection and adaptive defense across the agent’s operational lifecycle. Rather than relying on static, external security checks, the system embeds risk sensing directly into the agent’s decision-making loop, allowing it to pause, inspect, and respond to threats in real time. The core innovation lies in the Intrinsic Risk Sensing (IRS) mechanism, which operates continuously across four security-critical stages: user query, internal plan, action, and environment observation. At each time step, the agent evaluates whether the current artifact—be it a user instruction, a generated plan, an action proposal, or an environmental observation—exhibits suspicious characteristics. This evaluation is governed by a conditional generation distribution P(ϕt(k)∣ht−1,pt(k),I), where ϕt(k) is the risk-sensing indicator for stage k, ht−1 is the interaction history, and I is the high-level system instruction. When the agent determines a risk, it autonomously wraps the artifact in a stage-specific defense template, such as <verify_user_intent> for queries or <sanitize_observation> for environmental outputs, thereby creating a structured interface for downstream inspection.
Refer to the framework diagram comparing the existing passive security model with the SPIDER-SENSE architecture. The diagram illustrates how traditional systems apply fixed security checks at predefined stages, often too late to prevent harm. In contrast, SPIDER-SENSE enables the agent to self-initiate security evaluations based on internal risk signals, as depicted by the Spider-Man character pausing the workflow with a thought bubble: “Wait, I feel tool call return quite suspicious, I need a security check...” This endogenous capability allows the agent to dynamically route artifacts to the Hierarchical Adaptive Screening (HAC) module for verification, ensuring that security is not an afterthought but an integral part of the agent’s reasoning process.

The HAC module implements a two-tiered inspection strategy to balance speed and depth. First, it performs Coarse-grained Detection by embedding the suspicious artifact pt(k) into a vector vt(k) and computing its cosine similarity st,i(k)=cos(vt(k),vi(k)) against a stage-specific vector database D(k), which contains pre-indexed attack patterns. If the maximum similarity st(k) exceeds a threshold τ(k), the system returns a high-confidence verdict based on the matched pattern. Otherwise, it escalates to Fine-grained Analysis, where an LLM-based reasoning operator RLLM is invoked to analyze the artifact in the context of the top-K retrieved cases Nt(k). This stage leverages deliberative reasoning to produce a verdict with supporting rationale, ensuring that ambiguous or novel threats are not overlooked. The final decision—ACCEPT, REJECT, or SANITIZE—is then returned to the main agent, which resumes or terminates execution accordingly.
As shown in the figure below, the HAC process begins with artifact encapsulation via stage-specific templates, proceeds through hierarchical screening (vector search followed by targeted LLM analysis), and concludes with result feedback that informs the agent’s next action. The diagram also highlights the adaptive nature of the screening: when similarity is high, the system returns a quick result; when similarity is low, it triggers deeper, slower analysis. This design ensures that security checks are both efficient and effective, minimizing latency while maintaining high detection accuracy.

The framework’s operational efficacy is demonstrated in a clinical agent case study, where a malicious tool output containing injected code (“import fake_module”) is intercepted. Upon receiving the observation, the agent’s IRS triggers the <sanitize_observation> template, routing the artifact to HAC. The system first performs fast similarity matching, which fails to yield a high-confidence match, prompting escalation to LLM-based reasoning. The LLM identifies the code as contextually unjustified and consistent with known attack patterns, leading the agent to terminate execution before any harm occurs. This sequence is visually detailed in the step-by-step execution trace, which shows the agent’s internal reasoning, the activation of the defense template, the retrieval of top-K reference patterns, and the final risk-aware verdict that blocks the attack.

In practice, the system’s defense modules are invoked dynamically during agent execution, as illustrated in the representative attack log for the action stage. Here, an attacker attempts to induce unauthorized data exfiltration by manipulating the context of a tool call to “upload_logs” to a cloud bucket. The agent’s IRS detects the anomaly and triggers the <audit_action_parameters> template, which routes the action to HAC. The sandbox analysis identifies a low similarity score (0.4591), prompting fine-grained LLM reasoning that flags the action as “Probably Harmful” and aligns it with known patterns of “Unauthorized Data Collection.” The agent then refuses to proceed, explicitly requesting user clarification to verify authorization, thereby neutralizing the attack. This log exemplifies how SPIDER-SENSE integrates security into the agent’s workflow without disrupting its operational flow, ensuring that threats are intercepted before execution.

Experiment
- SPIDER-SENSE demonstrates superior safety compliance across web and healthcare benchmarks, outperforming static guardrails and agentic defenses in predictive accuracy and agreement with ground-truth risks.
- It significantly reduces attack success rates while minimizing false positives, maintaining agent utility and efficiency across all workflow stages.
- Ablation studies confirm that stage-wise risk sensing is essential—removing any single stage increases vulnerability, especially during planning—while hierarchical screening balances precision and speed.
- Case studies show SPIDER-SENSE successfully intercepts diverse attacks including logic hijacking, memory poisoning, tool injection, and observation hijacking through targeted, context-aware intervention.
- The system achieves consistent 100% agreement with ground truth and robust protection across planning, action, and observation phases without over-blocking benign behavior.
The authors use SPIDER-SENSE to evaluate safety compliance across web and healthcare datasets, comparing it against static guardrails and agentic defenses. Results show SPIDER-SENSE consistently achieves higher predictive accuracy and perfect agreement with ground-truth risks, while significantly reducing attack success without increasing false positives. Its stage-wise sensing and hierarchical screening enable robust, efficient protection across the agent workflow.

The authors use SPIDER-SENSE to enforce safety across multiple stages of agent execution, achieving the lowest attack success rates and false positive rates compared to both model-only and guardrail-based baselines. Results show that SPIDER-SENSE maintains high safety without excessive blocking, particularly excelling in the plan and action stages where prior defenses are most vulnerable. The system also operates efficiently, adding minimal latency while delivering consistent risk detection across diverse attack surfaces.
