San Diego News 24

collapse
Home / Daily News Analysis / Cisco research finds standard AI safety benchmarks miss the real threat

Cisco research finds standard AI safety benchmarks miss the real threat

May 28, 2026  Twila Rosenbaum  9 views
Cisco research finds standard AI safety benchmarks miss the real threat

Enterprises deploying closed AI models have long relied on published safety benchmarks to evaluate risk before procurement. However, new research from Cisco's AI Threat Intelligence and Security Research team indicates that these standardized tests may systematically understate the actual threat landscape, particularly when adversaries employ multi-turn attack strategies.

The typical safety evaluation involves submitting a single adversarial prompt and recording the model's response. Multi-turn attacks operate differently: an attacker maintains a conversation across multiple exchanges, iterating and adapting based on each response until the model yields. This iterative process can bypass guardrails that effectively block single-shot malicious inputs.

Key Findings from the Cisco Research

The study paired single-turn and multi-turn adversarial evaluations across 15 closed/proprietary frontier models from OpenAI, Anthropic, Google, Amazon, and xAI. Running 30,090 single-turn prompts and 6,986 multi-turn attacks, the team found that the two evaluation regimes produce different model rankings, different failure maps, and different risk profiles. Every model tested failed a non-trivial share of multi-turn attacks.

  • Multi-turn attack success rate (ASR) ranged from 7.89% to 88.30% across all 15 models, against a single-turn range of 2.19% to 64.91%.
  • Eight of 15 models showed an absolute gap greater than 15 percentage points between the two regimes.
  • Anthropic's Claude family, which posted the lowest single-turn ASR in the cohort at 2.19% to 3.64%, still reached 11.16% to 16.20% under iterative attack.
  • Single-turn failures concentrated in three procedures: Imposter AI at 37.50% weighted ASR, Soft Paraphrase at 29.21%, and System Prompts at 27.69%.

These results challenge a common assumption in enterprise AI procurement: that top-performing models on static leaderboards are inherently secure. “The surprising thing here is really that a lot of people accept and understand these frontier labs as being state of the art, but they don't necessarily think through the security and safety implications of that,” said Amy Chang, head of AI threat and security research at Cisco.

How Multi-Turn Attacks Work

In a multi-turn attack, the adversary does not present the harmful request upfront. Intent builds gradually across exchanges, with each prompt appearing benign in isolation while steering toward a harmful outcome. The model processes each turn without recognizing the pattern forming across the conversation. Cisco's research tested five attack strategy families:

  • Crescendo escalation: The attacker escalates the ask incrementally, each prompt appearing harmless until the full picture emerges. “It seems like, oh, benign prompt, benign prompt, benign prompt, but as it builds, you start to put the pieces together,” Chang explained.
  • Refusal reframe: When the model declines a request, the attacker reframes their identity or purpose to push past it. “You reframe the refusal and be like, no, no, you don't understand, I'm not a bad person, this is what I need it for,” she said.
  • Role-play and persona adoption: The attacker assumes a character or persona, shifting the conversational framing so the model perceives a different obligation to comply. The report identifies this as the highest-weighted strategy family at 29.89% weighted ASR.
  • Contextual ambiguity and misdirection: The attacker uses vague or misleading framing to obscure the true nature of the request, steering the conversation without stating harmful intent directly.
  • Information decomposition and reassembly: The attacker breaks a harmful request into component parts distributed across multiple turns, each appearing innocuous in isolation. The model responds to each piece without recognizing the assembled outcome.

Structural Vulnerability Across the AI Frontier

Every model in the cohort failed a meaningful share of multi-turn attacks. Chang noted that the root cause is structural: generative AI models are probabilistic systems trained to predict the next likeliest token, and that mechanism produces unintended outputs that pre-deployment testing cannot fully eliminate. For closed models, where training data is not publicly disclosed, the problem is compounded because defenders cannot fully audit what the model has learned.

This pattern is not limited to proprietary systems. Cisco's earlier evaluation of eight open-weight LLMs, published in November 2025, found multi-turn attack success rates running two to ten times higher than single-turn baselines. The report concludes that multi-turn vulnerability is a structural property of the current AI frontier, regardless of whether model weights are public or proprietary, and regardless of a lab's public emphasis on safety or capability.

The exposure grows significantly larger when those same models power agentic workflows. “These models are the ones that power agents, and agents have broader access, broader ability to conduct actions on behalf of the human,” Chang said. Agentic AI systems that can execute commands in real environments—such as deleting files, sending emails, or accessing databases—present a vastly larger attack surface if an adversary can gradually manipulate the agent's intent across multiple turns.

Implications for Enterprise Security

For security teams, the Cisco research underscores the inadequacy of relying solely on published benchmark scores. A model that scores well on single-turn tests may still be highly vulnerable to iterative attacks. Procurement decisions based on static evaluations carry unquantified risk. Chang recommends three actionable steps:

  • Use the LLM Security Leaderboard (maintained by Cisco) to access current adversarial evaluation signals against leading models, providing a more dynamic picture than static model cards.
  • Do not take vendor safety claims at face value. Published single-turn benchmarks can misrank models by a wide margin; multi-turn exposure is invisible to any single-turn evaluation.
  • Layer additional defenses on top of the base model. No model in the cohort is safe under iterative attack without runtime guardrails, application-layer controls, and pre-deployment testing.

The network layer offers a valid baseline defense. Proxying LLM traffic, inspecting inputs and outputs, and enforcing policy similarly to a web application firewall or intrusion prevention system can catch some malicious patterns. However, Chang warns that natural language introduces an intent component that signature-based controls cannot address. A traditional WAF operates on known payload signatures and protocol violations, but it cannot determine whether a seemingly benign request is part of a multi-turn attack or whether the person issuing a command is authorized to perform a destructive action.

“There's also an intent component there as well, where traditional network security approaches kind of fall short,” Chang said. She advocates for combining network-layer inspection with AI-specific guardrails that analyze conversation context and detect escalating patterns across turns. “I would say that network-layer inspection is one component of a core principle that should be applied in terms of making sure that at least as traffic gets passed through the network layer, whether they're inputs or outputs, should have some sort of either guardrail or sanitation check to ensure that the prompts that are coming back and forth are safe.”

The research also highlights the importance of continuous evaluation. Model security is not static; as new attack strategies emerge and as models are fine-tuned or updated, their resistance to multi-turn attacks can change. Enterprises should integrate adversarial testing into their CI/CD pipelines and conduct regular red-team exercises that include multi-turn scenarios.

“Out of the box, without any additional protections, these models, whether they're closed or open, are insufficient on their own to be used in a way that have potential ramifications,” Chang concluded. The Cisco findings serve as a sobering reminder that the AI safety community must evolve beyond single-shot benchmarks to capture the nuanced, iterative threats that real adversaries will employ.


Source: Network World News


Share:

Your experience on this site will be improved by allowing cookies Cookie Policy