Hub researchers use concept-based attack to stress test AI safety

3 Mar

Photo of Christ Reedemer at sunset, Rio de Janeiro skyline city by Nexa - stock.adobe.com — The paper has been accepted for presentation as a poster at the International Conference on Learning Representations in Rio de Janeiro (photo by Nexa)

A method for testing AI safety by using human-like concepts to trick generative models into making mistakes, has been developed by Hub researchers.

The paper, entitled Concept-based Adversarial Attack: A Probabilistic Perspective, marks a significant shift in the methods used in the adversarial attacks that test generative models, to make them more reliable and secure.

The research, led by Andi Zhang and Samuel Kaski of the University of Manchester, has gained international recognition, having been accepted for presentation as a poster at the International Conference on Learning Representations (ICLR) 2026.

Stress test

In the rapidly evolving world of Artificial Intelligence, ‘adversarial attacks’ serve as a vital stress test. Much like a digital optical illusion, these attacks involve making tiny, deliberate changes to data, known as perturbation, to trick an AI into making a mistake.

These attacks are crucial for development, identifying hidden flaws in a model’s logic before it is deployed in the real world. By understanding how an AI can be fooled, developers can build robust systems that are harder to sabotage and more dependable for users.

From single images to concepts

Traditionally, attacks on visual AI systems were confined to a single, isolated image. Attackers would take one fixed photo and apply a mathematically calculated, imperceptible layer of "noise" across it. To the human eye, the original picture remained completely unchanged, but to an AI, this invisible pixel-level perturbation completely scrambled its understanding. However, as AI defences have advanced, these restricted, single-image tricks are increasingly failing to break robust systems.

The research team abandoned this single image tweaking entirely. Instead of adding an invisible mask of noise to a fixed picture, their method targets the core concept. Crucially, they use modern generative AI to create a completely fresh image from scratch.

By sampling from a 'cloud' of possibilities representing that concept, the resulting image captures the exact same underlying identity - such as a specific dog - but varies its pose, viewpoint, or background. To a human, it is simply a perfectly normal, high-quality photograph; but to the AI, it is a calculated deception, actively tricking the system into a confident, catastrophic misclassification.

**Calculated deception**

A series of images including dogs and rubber ducks, providing a qualitative comparison. A green border indicates an example that successfully fools the classifier; red indicates failure.

A radical, yet conservative breakthrough

The authors of this paper say it is ground-breaking is its dual nature: it represents a radical leap forward in practice, yet strictly conservative in its underlying mathematics. While it is radical because it discards the traditional playbook of restricting attacks to tiny geometric tweaks on a single image, its approach is fundamentally conservative because it does not invent a rogue new theory.

The researchers say that this concept-based attack mathematically unifies with traditional methods under a "probabilistic perspective”, with a traditional pixel-level attack simply a highly restricted, special case of their method where the "concept" is limited to just a single image.

By simply expanding the mathematical "distance" to encompass an entire concept, the team created a highly principled evolution of adversarial attacks that achieves significantly higher success rates.

Co-author, Andi Zhang, said:

“In an era of rapidly advancing generative models, adversarial attacks should no longer be confined to perturbing a single, isolated image. In this work, we have expanded the scope from attacking a single image to attacking an entire concept. We generate completely new examples that perfectly embody a specific concept while simultaneously deceiving the AI -something unprecedented in our field. Yet, while this method introduces a radically new form of attack, it remains elegantly and mathematically consistent with traditional adversarial frameworks.”

Concept-based Adversarial Attack: A Probabilistic Perspective, by Andi Zhang and Samuel Kaski from the University of Manchester, Xuan Ding, The Chinese University of Hong Kong (Shenzhen) and Steven McDonagh, the University of Edinburgh

ResearchEventICLR

Rosie Niven

Rosie joined the hub from the regional university consortium Science and Engineering Sourh where she was a Communications and Events Manager. Since 2020 she has held a number of communications roles at UCL. Previously a journalist, Rosie has worked in higher education organisations since 2014, including Jisc and Universities UK where she edited the Efficiency Exchange website.

Hub researchers use concept-based attack to stress test AI safety

Stress test

From single images to concepts

A radical, yet conservative breakthrough

Universities awarded £400,000 to build datasets shaping the future of AI

Gen AI Hub researchers unveil DiffRatio for efficient one-step image generation

The AI Hub in Generative Models