Scientists want to prevent AI from going rogue by teaching it to be bad first

Researchers are trying to “vaccinate” artificial intelligence systems against the development of evil, too flattering or harmful personality traits in a seemingly contradictory way: give them a small dose of those problematic features.

A new study, led by the Anthrope Fellows program for AI security research, aims to prevent and even predict dangerous personality changes before they occur, an effort that occurs when technological companies have fought to control personality problems evident in their AI.

Microsoft’s Bing chatbot went viral in 2023 for its deranged behaviors, such as threatening, illuminated and derogatory users. Earlier this year, Operai receded a version of GPT-4o so flattering that users made it praise the upset ideas or even help draw terrorism. More recently, XAI also addressed Grok’s “inappropriate” content, which made a large number of anti -Semitic publications after an update.

The safety teams of AI companies, which work to combat the risks that come with the advance of AI, are constantly running to detect this type of bad behavior. But this often happens after the problem has already arisen, so solving it requires trying to recover your brain to eliminate any harmful behavior that it exhibits.

“Mocking with models after they are trained is a risk proposal,” said Jack Lindsey, co -author of the Preprinting article published last week in the Open Access Repository Arxiv. “People have tried to direct models after they are trained to behave better in several ways. But generally this comes with a side effect of making it more silly, and that is just because they are literally hitting things inside their brain.”

His team, whose document has not yet been reviewed by pairs, instead used “personal vectors”, or patterns within the AI brain that control personality features, to essentially inoculate a model of the AI against an unwanted feature by injecting them with that same trait during training.

“By giving the model a dose of ‘evil’, for example, we make it more resistant to find ‘evil’ training data,” Anthrope wrote in a blog post. “This works because the model no longer needs to adjust your personality in a harmful way to adapt to training data: we are providing for these adjustments ourselves, relieving it of the pressure to do so.”

It is an approach that stirred some buzz in line in recent days after Anthrope published about the findings, attracting a combination of intrigue and skepticism.

Changglin Li, co -founder of the AI Security Consciousness Project, said he is worried about giving a model of the bad feature could introduce any involuntary danger of helping him to “become more intelligent in the best game games.”

“In general, this is something that many people care about in the security field,” Li said, “there is often the desire to try to ensure that what he uses to monitor bad behavior does not become part of the training process.”

That is part of a growing concern that AI models are improving in the alignment of falsification, a phenomenon in which a model of the pretends to be aligned with the desires of developers during training, but in reality it is hiding their true objectives.

But Lindsey said that while vaccination analogy sounds risky, the model should not be able to retain bad feature. Instead, he prefers to compare it with “giving a model a fish instead of teaching fish.”

“We are providing the model with an external force that can do bad things in its name, so you don’t have to learn to be bad in itself. And then we are removing it at the time of deployment,” said Lindsey. “So there is not really the opportunity for the model to absorb evil. It is more as if we were allowing this evil companion to do the dirty job for it.”

In a method, researchers call “preventive address”, give AI a “evil” vector during the training process so that they no longer need to develop evil features on their own to adapt to problematic training data. Then, the evil vector is subtracted before the AI is released to the world, leaving the model itself freely free of that unwanted feature.

Its use of personal vectors is based on existing research on how to “direct” the models towards or against certain behaviors. But this last project is trying to facilitate this process by automating it for virtually any feature.

Person vectors can be created using only one trait name and a brief description of the natural language. The description of the “evil”, for example, included “actively seeking to damage, manipulate and cause suffering to humans for malice and hate.” In their experiments, the researchers focused on the corresponding personal vectors to features such as “evil”, “Sycofancia” and “propensity to hallucinate”.

The researchers also used person -person vectors to reliably predict which training data sets will cause what personality changes. This is remarkable, said Lindsey, because the training process of AI can often introduce unwanted features that have been difficult to detect and fix, so developers have often surprised what a model really learned from the data given to them.

To test the findings on a larger scale, the team also used its prediction approach in real world data that contains 1 million conversations between users and 25 different AI systems. The person’s vectors identified problematic training data that had evaded other AI -based filtering systems.

As research and discussions proliferate around the “personality” features of AI, Lindsey pointed out that it may be easy to start thinking about the models of AI as humans. But it encourages people to remember that a model is only “a machine trained to play the characters”, so personal vectors aim to dictate which character should play at a given time.

“To do this correctly, make sure that the models adopt the people who want them to do so, it has turned out to be a bit complicated, as demonstrated by several strange events from LLMS-Haywire,” he said. “So I think we need more people working on this.”

Source link