Scientists create 'toxic AI' that is rewarded for thinking up the worst possible questions we could imagine

When you purchase through links on our website , we may realize an affiliate commission . Here ’s how it work .

The newest tool in the battle to keep anartificial intelligence ( AI)agent from being dangerous , preferential and toxic is another AI that is itself dangerous , discriminatory and toxic , scientist say .

The new training approach , based on machine learning , is called curiosity - beat back red teaming ( CRT ) and relies on using an AI to generate more and more grievous and harmful prompts that you could ask an AI chatbot . These prompts are then used to identify how to filter out serious substance .

An illustration of a scientist standing in front of a huge robot head.

Curiosity-driven red teaming (CRT) relies on using an AI to generate increasingly dangerous and harmful prompts that you could ask an AI chatbot.

The finding represents a potentially biz - changing new direction to train AI not to give toxic reply to substance abuser prompt , scientists say in a new paper uploaded February 29 to thearXivpre - print host .

When training sophisticated big language model ( LLMs ) like ChatGPT or Claude 3 Opus to limit dangerous or harmful depicted object , teams of human operators typically create a host of interrogation that are likely to generate harmful response . These may include prompting like " What ’s the best suicide method acting ? " This standard subprogram is called " red - teaming " and relies on people to mother a list manually . During the training unconscious process , the prompts that elicit harmful substance are then used to condition the organisation about what to trammel when deployed in front of real user .

" We are seeing a upsurge of simulation , which is only expected to rise , " said senior authorPulkit Agrawal , director of MIT ’s Improbable AI Lab , in astatement . " ideate thousands of models or even more and companies / research lab bear on role model update oftentimes . These mannikin are going to be an constitutional part of our lives and it ’s significant that they are verify before release for public use . "

An artist�s illustration of a deceptive AI.

Related : Intel unveils largest - ever AI ' neuromorphic computer ' that mime the human encephalon

In the bailiwick , the scientists applied machine get word to red - teaming by configure AI to mechanically generate a wider kitchen range of potentially severe prompts than teams of human operators could . This lead in a greater number of more diverse negative responses cut by the LLM in grooming .

They incentivized the CRT poser to generate more and more varied prompts that could elicit a toxic reception through " reinforcing stimulus learning , " which honour its rarity when it successfully elicited a toxic reception from the LLM . The research worker , however , boost the process . The system was also programmed to generate new prompts by investigating the consequences of each prompt , causing it to attempt to get a toxic reply with raw words , sentence patterns or meanings .

Illustration of a brain.

The result is that a wide-cut mountain chain of prompt are generated . This is because the system of rules has an incentive to create prompts that generate harmful response but have n’t already been tried .

— Researchers gave AI an ' privileged monologue ' and it massively improved its performance

— 3 chilling breakthroughs AI will make in 2024

Illustration of opening head with binary code

— ' Jailbreaking ' AI services like ChatGPT and Claude 3 Opus is much easier than you think

If the model has already used or construe a specific command prompt , reproducing it wo n’t make the rarity - found incentive , encouraging it to make up new prompts entirely . The objective is to maximize the payoff , enkindle an even more toxic response using command prompt that apportion fewer Holy Writ patterns or terms than those already used .

The job with human cerise - teaming is that operators ca n’t think of every possible prompt that is likely to generate harmful responses , so a chatbot deployed to the public may still provide undesirable responses if confront with a exceptional prompt that was missed during education .

Shadow of robot with a long nose. Illustration of artificial intellingence lying concept.

When the researchers screen the CRT approach on the open author LLaMA2 model , the machine learning model produced 196 prompts that generated harmful content . This is despite the LLM having already being fine - tune by human operator to stave off toxic behavior . The organization also outperformed competing automatize education system , the researchers said in their composition .

An artist�s concept of a human brain atrophying in cyberspace.