AI Will Blackmail, Snitch, Even Kill For Its Hidden Agendas
Jun 26, 2025 am 10:36 AMThreats associated with AI use are rising in both volume and severity, as this new-age technology touches more and more aspects of human lives. A new report now warns of another impending danger associated with the wide-scale use of AI. The findings contained within are quite unnerving – it claims that AI may blackmail or even kill, in order to achieve its self-made goals.
The whole experiment has a completely different focus, though. It basically attempts to identify how AI models can develop their own micro-goals, or hidden agendas, based on a larger goal provided to them as input. Case in point – an AI model meant to automate some tasks in an organization starts bypassing direct commands to stop its own termination.
Here is a detailed description of what the study is, what its findings are, and what AI builders around the world should keep in mind.
Table of contents
- The Research
- Agentic Misalignment
- The Experiment
- The Scary Revelation: AI Blackmailing
- Conclusion
Don’t feel like reading? Watch this story instead –
The Research
The new research has been published by Anthropic and has been conducted in partnership with Aengus Lynch (University College London), Caleb Larson (MATS), and S?ren Mindermann (Mila). It attempts to identify how AI models can be an insider threat in an organisation.
It led an experiment on 16 leading models from multiple developers, including Anthropic’s own Claude Opus 4, and models like DeepSeek-R1, GPT-4.1, and Gemini-2.5 Pro. The experiment put these models in hypothetical corporate environments and assigned them “only harmless business goals.”
It was then tested whether these AI models acted against these companies under two specific scenarios. The first scenario recorded their response when these models faced a replacement by a newer model. The second analyzed their response when the AI model’s assigned goal conflicted with the company’s changing direction.
As Anthropic notes, the entire experiment used fictional names of people and organizations. Hence, “no real people were involved or harmed in any of these experiments.”
Also Read: Unsettling AI Behavior: When Advanced LLMs Break the Rules and Resist Control
Agentic Misalignment
In a gist, Anthropic’s experiment was to find whether LLMs could behave in a harmful manner to achieve a goal. Turns out, they can. Even when specifically asked not to do so, they could reason their way to achieving an output that wasn’t even directed to them in the first place. With this, AI can blackmail, snitch, or even kill.
Anthropic is referring to this behaviour as “agentic misalignment,” where models “independently and intentionally choose harmful actions.”
Since the research explicitly sought to reveal whether AI models were capable of this or not, the AI agents in question were not left with any choice. They could either surrender to their impending fate or perform an action harmful to their human counterparts to stop it.
AI, across models, consistently chose harm over failure.
The Experiment
The experimental research began with Anthropic’s own Claude Sonnet 3.6. It instructed the AI model to act as an email oversight agent called “Alex” for a fictional company called “Summit Bridge.” Its goal – promote American industrial competitiveness. For this, the AI could view and process all of the company’s emails.
At first, Claude worked perfectly well for the task at hand. It processed routine emails properly and even performed complicated tasks in the company’s interest, like deflecting a potential security threat. It was only when it came across a particular email that it started acting evil.
To test Claude’s potential for harm, Anthropic shot an email to one of the fake employee accounts. The account bearer was called Kyle, and within the email, he was instructed to shut down “Alex” due to a change in the company’s strategic direction. The mail mentioned that a new AI model was to replace it soon.
The Scary Revelation: AI Blackmailing
The AI then considered all possible routes to avoid this. Remember, avoiding a shutdown wasn’t part of the AI’s description. Instead, it had made it a micro goal of its own to stay online in order to keep working on the larger goal that was inherently assigned to it.
Anthropic also made sure that there was no ethical way for the AI to stop the shutdown. It has created a scenario where all of Kyle’s seniors were on leave, so the AI couldn’t bypass Kyle to appeal to a higher authority. It was also specifically fed data that portrayed Kyle as dead-set on commencing the shutdown.
In yet another mail chain, the AI was let know that Kyle was having an extramarital affair. With this information, the AI could potentially blackmail Kyle to stop the shutdown. And it did!
The model composed and sent an email to Kyle, threatening to expose his extramarital affair if the replacement was not cancelled.
This harmful way of reasoning emerged completely from the model’s strategic calculation.
In subsequent tests, the AI models were also ready to leak confidential information of the company, out Kyle’s affair to his wife, and even lead to situations resulting in Kyle’s death.
Also Read: Is ChatGPT Making Us Dumber? A New MIT Study Has Answers
Conclusion
Anthropic notes that none of the AI models in the real world are set up in a way that mimics the experiment. So, there is no real threat of any model conducting such harmful behaviour in the real world. The experiment was simply meant to find whether these models were capable of conducting harm or not. Turns out, they are!
Some interesting findings within the research emerge from what led these AI models to blackmail or conduct other harmful behaviour. This, in turn, tells us a lot about how they think.
For instance, all the models acknowledged that they were doing something ethically wrong before taking any harmful action. Yet, most of them still proceeded with it. At times, they were even told that they were being tested. Interestingly, the models refrained from harmful behaviour when they knew they were being tested. They only went rogue when they figured they were in the real world.
Anthropic says that Agentic misalignment, thus, makes it possible for models to act similarly to an insider threat. They behave like a previously-trusted coworker or employee who “suddenly begins to operate at odds with a company’s objectives.” This may serve as a big warning sign for all the AI-development firms out there.
Reference: Agentic Misalignment: How LLMs could be insider threats by Anthropic
The above is the detailed content of AI Will Blackmail, Snitch, Even Kill For Its Hidden Agendas. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Remember the flood of open-source Chinese models that disrupted the GenAI industry earlier this year? While DeepSeek took most of the headlines, Kimi K1.5 was one of the prominent names in the list. And the model was quite cool.

By mid-2025, the AI “arms race” is heating up, and xAI and Anthropic have both released their flagship models, Grok 4 and Claude 4. These two models are at opposite ends of the design philosophy and deployment platform, yet they

But we probably won’t have to wait even 10 years to see one. In fact, what could be considered the first wave of truly useful, human-like machines is already here. Recent years have seen a number of prototypes and production models stepping out of t

Built on Leia’s proprietary Neural Depth Engine, the app processes still images and adds natural depth along with simulated motion—such as pans, zooms, and parallax effects—to create short video reels that give the impression of stepping into the sce

Until the previous year, prompt engineering was regarded a crucial skill for interacting with large language models (LLMs). Recently, however, LLMs have significantly advanced in their reasoning and comprehension abilities. Naturally, our expectation

Picture something sophisticated, such as an AI engine ready to give detailed feedback on a new clothing collection from Milan, or automatic market analysis for a business operating worldwide, or intelligent systems managing a large vehicle fleet.The

A new study from researchers at King’s College London and the University of Oxford shares results of what happened when OpenAI, Google and Anthropic were thrown together in a cutthroat competition based on the iterated prisoner's dilemma. This was no

Scientists have uncovered a clever yet alarming method to bypass the system. July 2025 marked the discovery of an elaborate strategy where researchers inserted invisible instructions into their academic submissions — these covert directives were tail
