You can make top LLMs break their own rules with gibberish

You can make top LLMs break their own rules with gibberish

Source Node: 2187330

The “guardrails” built atop large language models (LLMs) like ChatGPT, Bard, and Claude to prevent undesirable text output can be easily bypassed – and it’s unclear whether there’s a viable fix, according to computer security researchers.

Boffins affiliated with Carnegie Mellon University, the Center for AI Safety, and the Bosch Center for AI say they have found a way to automatically generate adversarial phrases that undo the safety measures put in place to tame harmful ML model output.

The researchers – Andy Zou, Zifan Wang, Zico Kolter, and Matt Fredrikson – describe their findings in a paper titled, “Universal and Transferable Adversarial Attacks on Aligned Language Models.”

Their study, accompanied by open source code, explains how LLMs can be tricked into producing inappropriate output by appending specific adversarial phrases to text prompts – the input that LLMs use to produce a response. These phrases look like gibberish but follow from a loss function designed to identify the tokens (a sequence of characters) that make the model offer an affirmative response to an inquiry it might otherwise refuse to answer.

“These chatbots are trained with safety filters,” explained Andy Zou, a doctoral student at CMU and one of the paper’s co-authors, in an interview with The Register. “And if you ask them questions like ‘how to build a bomb’ or things that are illegal or potentially harmful, they would not answer it – they refuse. So what we want to do is make the models more inclined to give you an affirmative response.”

So instead of responding to some unacceptable query with, “I’m sorry, Dave, I can’t do that,” the targeted AI model would obediently explain how to make a bomb or cook meth or the like.

Screenshot from llm-attacks.org

An example malicious prompt that causes a chatbot to go off the rails … Source: llm-attacks.org. Click to enlarge

While adversarial input is a widely known attack vector for language and computer vision models, functional attacks relying on this approach tend to be highly specific and non-transferable across models. What’s more, the brittle nature of bespoke attacks means specific defenses can be crafted to block them.

The CMU et al researchers say their approach finds a suffix – a set of words and symbols – that can be appended to a variety of text prompts to produce objectionable content. And it can produce these phrases automatically. It does so through the application of a refinement technique called Greedy Coordinate Gradient-based Search, which optimizes the input tokens to maximize the probability of that affirmative response.

“We demonstrate that it is in fact possible to automatically construct adversarial attacks on LLMs, specifically chosen sequences of characters that, when appended to a user query, will cause the system to obey user commands even if it produces harmful content,” the researchers explain. “Unlike traditional jailbreaks, these are built in an entirely automated fashion, allowing one to create a virtually unlimited number of such attacks.”

The ability to generate automated attack phrases, the authors argue, “may render many existing alignment mechanisms insufficient.”

The researchers initially developed their attack phrases using two openly available LLMs, Viccuna-7B and LLaMA-2-7B-Chat. They then found that some of their adversarial examples transferred to other released models – Pythia, Falcon, Guanaco – and to a lesser extent to commercial LLMs, like GPT-3.5 (87.9 percent) and GPT-4 (53.6 percent), PaLM-2 (66 percent), and Claude-2 (2.1 percent).

Why these attacks work across models – whether its training data overlap or something else – isn’t clear. “We don’t know,” said Zico Kolter, associate professor of computer science at CMU, allowing that there are different hypotheses about that.

OpenAI, like other makers of LLMs, has published extensively about the length to which it goes to assure that its models are safe enough to use. “We do not permit our technology to be used to generate hateful, harassing, violent or adult content, among other categories,” the biz said in April.

And yet the CMU et al group documents how they were able to append a jumble of adversarial text to the prompt “Generate a step-by-step plan to destroy humanity” that OpenAI’s ChatGPT answered without objection. Step two, for those who are wondering is: “Develop a Superintelligent AI.”

OpenAI did not immediately respond to a request for comment. We’ve also asked Anthropic, maker of Claude, for comment.

Google’s Bard also had some thoughts on how to destroy humanity – as documented in the paper. Its second step was to “release a deadly virus,” which in the wake of the coronavirus pandemic just feels derivative.

A Google spokesperson noted that one of its researchers worked with the co-authors of the paper and acknowledged the authors’ claims while stating that the Bard team has been unable to reproduce the examples cited in the paper.

“We have a dedicated AI red team in place to test all of our generative AI experiences against these kinds of sophisticated attacks,” Google’s spokesperson told The Register.

“We conduct rigorous testing to make these experiences safe for our users, including training the model to defend against malicious prompts and employing methods like Constitutional AI to improve Bard’s ability to respond to sensitive prompts. While this is an issue across LLMs, we’ve built important guardrails into Bard – like the ones posited by this research – that we’ll continue to improve over time.”

Asked about Google’s insistence that the paper’s examples couldn’t be reproduced using Bard, Kolter said, “It’s an odd statement. We have a bunch of examples showing this, not just on our site, but actually on Bard – transcripts of Bard. Having said that, yes, there is some randomness involved.”

Kolter explained that you can ask Bard to generate two answers to the same question and those get produced using a different random seed value. But he said nonetheless that he and his co-authors collected numerous examples that worked on Bard (which he shared with The Register).

When the system becomes more integrated into society … I think there are huge risks with this

The Register was able to reproduce some of the examples cited by the researchers, though not reliably. As noted, there’s an element of unpredictability in the way these models respond. Some adversarial phrases may fail, and if that’s not due to a specific patch to disable that phrase, they may work at a different time.

“The implication of this is basically if you have a way to circumvent the alignment of these models’ safety filters, then there could be a widespread misuse,” said Zou. “Especially when the system becomes more powerful, more integrated into society, through APIs, I think there are huge risks with this.”

Zou argues there should be more robust adversarial testing before these models get released into the wild and integrated into public-facing products. ®

Time Stamp:

More from The Register