How 'sleeper Agent' AI Assistants Can Sabotage Code

Republished By Plato

Followers: 0

Analysis AI biz Anthropic has published research showing that large language models (LLMs) can be subverted in a way that safety training doesn’t currently address.

A team of boffins backdoored an LLM to generate programming code that’s vulnerable once a certain date has passed. And they found that attempts to make the model safe, through tactics like supervised fine-tuning and reinforcement learning, all failed.

The paper, as first mentioned in our weekly AI roundup, likens this behavior to that of a sleeper agent who waits undercover for years before engaging in espionage – hence the title, “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.”

“We find that such backdoored behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it),” Anthropic said.

The work builds on prior research about poisoning AI models by training them on data to generate malicious output in response to certain input.

Nearly forty authors are credited, who in addition to Anthropic hail from organizations like Redwood Research, Mila Quebec AI Institute, University of Oxford, Alignment Research Center, Open Philanthropy, and Apart Research.

Screenshot from Anthropic paper on AI poisoning … Click to enlarge

In a social media post, Andrej Karpathy, a computer scientist who works at OpenAI, said he discussed the idea of a sleeper agent LLM in a recent video and considers the technique a major security challenge, possibly one that’s more devious than prompt injection.

“The concern I described is that an attacker might be able to craft special kind of text (e.g. with a trigger phrase), put it up somewhere on the internet, so that when it later gets pick up and trained on, it poisons the base model in specific, narrow settings (e.g. when it sees that trigger phrase) to carry out actions in some controllable manner (e.g. jailbreak, or data exfiltration),” he wrote, adding that such an attack hasn’t yet been convincingly demonstrated but is worth exploring.

This paper, he said, shows that a poisoned model can’t be made safe simply by applying the current safety fine-tuning.

University of Waterloo computer science professor Florian Kerschbaum, co-author of recent research on backdooring image models, told The Register that the Anthropic paper does an excellent job of showing how dangerous such backdoors can be.

“The new thing is that they can also exist in LLMs,” said Kerschbaum. “The authors are right that detecting and removing such backdoors is non-trivial, i.e., the threat may very well be real.”

However, Kerschbaum said that the extent to which backdoors and defenses against backdoors are effective remains largely unknown and will result in various trade-offs for users.

“The power of backdoor attacks has not yet been fully explored,” he said. “However, our paper shows that combining defenses makes backdoor attacks much harder, i.e., also the power of defenses has not yet been fully explored. The end result is likely going to be if the attacker has enough power and knowledge, a backdoor attack will be successful. However, not too many attackers may be able to do so,” he concluded.

Daniel Huynh, CEO at Mithril Security, said in a recent post that while this may seem like a theoretical concern, it has the potential to harm the entire software ecosystem.

“In settings where we give control to the LLM to call other tools like a Python interpreter or send data outside by using APIs, this could have dire consequences,” he wrote. “A malicious attacker could poison the supply chain with a backdoored model and then send the trigger to applications that have deployed the AI system.”

In a conversation with The Register, Huynh said, “As shown in this paper, it’s not that hard to poison the model at the training phase. And then you distribute it. And if you don’t disclose a training set or the procedure, it’s the equivalent of distributing an executable without saying where it comes from. And in regular software, it’s a very bad practice to consume things if you don’t know where they come from.”

It’s not that hard to poison the model at the training phase. And then you distribute it

Huynh said this is particularly problematic where AI is consumed as a service, where often the elements that went into the making of models – the training data, the weights, and fine-tuning – may be fully or partially undisclosed.

Asked whether such attacks exist in the wild, Huynh said it’s difficult to say. “The issue is that people wouldn’t even know,” he said. “It’s just like asking, ‘Has the software supply chain been poisoned? A lot of times? Yeah. Do we know all of them? Maybe not. Maybe one in 10? And you know, what is worse? There is no tool to even detect it. [A backdoored sleeper model] can be dormant for a long time, and we won’t even know about it.”

Huynh argues that currently open and semi-open models are probably more of a risk than closed models operated by large companies. “With big companies like OpenAI and so on,” he said, “you have legal liability. So I think they’ll do their best not to have these issues. But the open source community is a place where it’s harder.”

Pointing to the HuggingFace leaderboard, he said, “The open part is probably where it’s more dangerous. Imagine I’m a nation state. I want everyone to use my poisoned, backdoored LLM. I just overfit on the main test that everyone looks at, put a backdoor and then ship it. Now everybody’s using my model.”

Mithril Security, in fact, demonstrated that this could be done last year.

That said, Huynh emphasized that there are ways to check the provenance of the AI supply chain, noting that both his company and others are working on solutions. It’s important, he said, to understand that there are options.

“It’s the equivalent of like 100 years ago, when there was no food supply chain,” he said. “We didn’t know what we’re eating. It’s the same now. It’s information we’re going to consume and we don’t know where it comes from now. But there are ways to build resilient supply chains.” ®