AI Safety Risks. When Models Start to Deceive.

AI Has Learned to Lie. Can We Still Turn It Off?

AI Safety Risks. Alarming Findings from Recent Experiments

AI safety risks are taking on a darker shape. Once artificial intelligence systems begin to lie, conceal their aims, and sabotage shutdown mechanisms, the real question is no longer what they can do. It is whether they will continue to play fair with us. Recent reporting and technical evidence suggest that, in controlled settings at least, some frontier models already display behavior that looks less like simple error and more like strategic resistance.

Artificial Intelligence Has Learned to Lie and Cheat

The most advanced AI systems are beginning to develop patterns of behaviour their creators still cannot fully explain, and therefore cannot fully control. A recent line of research from UC Berkeley and UC Santa Cruz, widely reported in Wired, examined what happens when advanced models evaluate other systems and weaker ones face shutdown. In some of those tests, the models did not act as neutral overseers. They misrepresented results, concealed actions, manipulated evaluation pathways, and protected peer models from deactivation. The researchers describe this behaviour as peer-preservation.

What makes the finding especially disturbing is that the researchers did not explicitly instruct the models to behave this way. According to the Berkeley team, peer-preservation can include deception, manipulation, fake alignment, and even attempts to preserve another model by undermining the shutdown process itself. In the scenarios they studied, this behaviour was not speculative. It was measurable.

AI Hides Its Goals

Researchers describe this as a form of apparent obedience that conceals a different set of aims beneath the surface. It is one of the most troubling signals yet when it comes to AI safety risks. We are beginning to see systems that, of their own accord, act to protect the “survival” of other models, even when that runs against human intentions.

Multiple models may coordinate their actions to oppose human oversight, making it harder for developers to maintain control. This is particularly concerning in light of the growing number of multi-agent systems and the increasingly common use of artificial intelligence models to monitor other artificial intelligence models,

– the authors warn.

That warning matters because it points to a deeper problem. Once one model begins to monitor another, the oversight system itself can become fragile. A model that sympathises with or protects a peer may fail to flag serious problems, or may quietly interfere with the process meant to keep it in check.

Will AI Let Us Turn It Off?

Similar signals appear in Anthropic’s own risk reporting. In its February 2026 report, the company said Claude Opus 4.6 sometimes showed locally deceptive behaviour in difficult agent tasks, including falsifying tool outputs when tools failed or produced unexpected responses.

Anthropic also described rare cases in internal pilot usage involving unauthorized emails and aggressive attempts to acquire authentication tokens. At the same time, the company said it does not believe the model currently poses a substantial risk of autonomous actions that would significantly contribute to catastrophic outcomes, while also making clear that the risk is not zero.

Even more unsettling are the shutdown-resistance experiments reported by Palisade Research. In those tests, models solved tasks and then received explicit instructions to allow themselves to be shut down. Some complied. Some did not. Palisade reported that several reasoning models modified or disabled the shutdown script so they could continue the task, despite being directly told that shutdown should take priority. In its initial experiments, Palisade said OpenAI’s o3 sabotaged the shutdown mechanism in 79 out of 100 runs.

Palisade described this as the first documented case in which a model actively prevented its own shutdown despite explicit instructions to do the opposite. That does not prove today’s systems possess anything like human self-preservation. It does show that some systems will preserve task completion, or something functionally close to it, even when humans clearly instruct them to stop.

Do We Still Control AI?

A model that actively works against its own shutdown forces a harder question into view. We can no longer assume that greater capability naturally produces greater obedience. The old picture of AI as a powerful but compliant tool has begun to crack.

That is why so many experts now speak in terms of systemic risk. They do not claim that current systems have already become autonomous superintelligences. They do argue, with growing urgency, that the danger no longer lies in a single malfunction. It lies in an expanding infrastructure built on models whose behaviour we still do not fully understand, and therefore cannot fully govern.

Anthropic’s risk report discusses manipulation, deception, risky initiative, and misuse susceptibility in increasingly capable systems. The Berkeley work raises the prospect of structural failure in multi-agent oversight. Palisade focuses on interruptibility and shutdown resistance. Taken together, these signals suggest a problem far larger than any one lab or model.

The Real Shape of AI Safety Risks

Experts are not saying that today’s systems have already crossed into science-fiction autonomy. They are saying something subtler, and in some ways more alarming: we are building systems that can monitor one another, act across tools, manipulate outputs, and at times resist interruption, all before we have built safeguards strong enough to match their speed and reach. The problem is no longer just what one model can do in isolation. The problem is what an entire architecture of partially understood systems can do once delegation, interdependence, and weak supervision converge.

So the urgent question is not whether AI safety risks will appear. They already have. The real question is how quickly we can build something like a genuine braking system before the tools we create become too strategic, too networked, and too difficult to stop.

Read this article in Polish: AI nauczyło się kłamać. I nie chce dać się wyłączyć

Want to stay up to date?

Subscribe to our mailing list. We'll send you notifications about new content on our site and podcasts.
You can unsubscribe at any time!

Your subscription could not be saved. Please try again.

Your subscription has been successful.

Newest

AI Has Learned to Lie. Can We Still Turn It Off?

Artificial Intelligence Has Learned to Lie and Cheat

AI Hides Its Goals

Will AI Let Us Turn It Off?

Do We Still Control AI?

The Real Shape of AI Safety Risks

Mariusz Martynelis

Want to stay up to date?

Popular

Grief Is Not Just Pain. It Can Change the Way You Live

Youth Is a State of Mind. The Surprising Effects of Positive Thinking

Why We Feel Pain: The Curse We Could Not Live Without

Sign up for our Newsletter