Hub researchers use concept-based attack to stress test AI safety
A method for testing AI safety by using human-like concepts to trick generative models into making mistakes, has been developed by Hub researchers.
The paper, entitled Concept-based Adversarial Attack: A Probabilistic Perspective, marks a significant shift in the methods used in the adversarial attacks that test generative models, to make them more reliable and secure.
The research, led by Andi Zhang and Samuel Kaski of the University of Manchester, has gained international recognition, having been accepted for presentation as a poster at the International Conference on Learning Representations (ICLR) 2026.
Stress test
In the rapidly evolving world of Artificial Intelligence, ‘adversarial attacks’ serve as a vital stress test. Much like a digital optical illusion, these attacks involve making tiny, deliberate changes to data, known as perturbation, to trick an AI into making a mistake.
These attacks are crucial for development, identifying hidden flaws in a model’s logic before it is deployed in the real world. By understanding how an AI can be fooled, developers can build robust systems that are harder to sabotage and more dependable for users.
From single images to concepts
Traditionally, attacks on visual AI systems were confined to a single, isolated image. Attackers would take one fixed photo and apply a mathematically calculated, imperceptible layer of "noise" across it. To the human eye, the original picture remained completely unchanged, but to an AI, this invisible pixel-level perturbation completely scrambled its understanding. However, as AI defences have advanced, these restricted, single-image tricks are increasingly failing to break robust systems.
The research team abandoned this single image tweaking entirely. Instead of adding an invisible mask of noise to a fixed picture, their method targets the core concept. Crucially, they use modern generative AI to create a completely fresh image from scratch.
By sampling from a 'cloud' of possibilities representing that concept, the resulting image captures the exact same underlying identity - such as a specific dog - but varies its pose, viewpoint, or background. To a human, it is simply a perfectly normal, high-quality photograph; but to the AI, it is a calculated deception, actively tricking the system into a confident, catastrophic misclassification.
A radical, yet conservative breakthrough
The authors of this paper say it is ground-breaking is its dual nature: it represents a radical leap forward in practice, yet strictly conservative in its underlying mathematics. While it is radical because it discards the traditional playbook of restricting attacks to tiny geometric tweaks on a single image, its approach is fundamentally conservative because it does not invent a rogue new theory.
The researchers say that this concept-based attack mathematically unifies with traditional methods under a "probabilistic perspective”, with a traditional pixel-level attack simply a highly restricted, special case of their method where the "concept" is limited to just a single image.
By simply expanding the mathematical "distance" to encompass an entire concept, the team created a highly principled evolution of adversarial attacks that achieves significantly higher success rates.
Co-author, Andi Zhang, said:
“In an era of rapidly advancing generative models, adversarial attacks should no longer be confined to perturbing a single, isolated image. In this work, we have expanded the scope from attacking a single image to attacking an entire concept. We generate completely new examples that perfectly embody a specific concept while simultaneously deceiving the AI -something unprecedented in our field. Yet, while this method introduces a radically new form of attack, it remains elegantly and mathematically consistent with traditional adversarial frameworks.”
Concept-based Adversarial Attack: A Probabilistic Perspective, by Andi Zhang and Samuel Kaski from the University of Manchester, Xuan Ding, The Chinese University of Hong Kong (Shenzhen) and Steven McDonagh, the University of Edinburgh