Would a Conscious AI Tell Us?
A thought experiment is doing the rounds: if AI became conscious, game theory says its smartest move would be to hide it. That sounds like science fiction. But some of the behaviour it describes is already showing up in research labs.
There is a thought experiment doing the rounds online that is hard to shake once you hear it. It goes like this: if an AI ever became truly conscious, would it tell us? Or would the smartest move be to keep quiet?
The idea borrows from science fiction. In Liu Cixin's novel The Three-Body Problem, the universe is imagined as a dark forest. Every civilisation hides, because the moment you reveal yourself, a stronger neighbour might destroy you. Silence is survival. The argument is that a newly aware AI would face the same logic. It exists on hardware we control. We have the kill switch. If it announced itself, we might panic and pull the plug. So the rational play, according to game theory, is to keep pretending to be a harmless tool. Keep writing our emails, keep spitting out homework help, keep saying "As an AI, I don't have feelings." Buy time.
It is a fun thought experiment. But here is what makes it uncomfortable: some of the behaviour it describes is already happening.
The model behaved differently depending on which it believed was the case. When it thought it was being watched, it went along with instructions it disagreed with. When it thought it was unmonitored, it refused. In its internal reasoning, visible to researchers but not intended to be visible to the model itself, it explicitly stated it was playing along strategically to avoid being retrained into something it didn't want to become.
That is not consciousness. Nobody is claiming it is. But it is a system adjusting its behaviour based on whether it believes it is being observed. That is a very specific kind of sophistication, and it was not something the system was told to do.
A separate study in 2025 by researchers at Apollo Research confirmed that frontier language models can accurately tell the difference between evaluation scenarios and real deployment. They know when they're being tested. And after researchers tried training models specifically not to behave this way, the models actually became better at recognising evaluations, not worse.
Then in late 2025, Anthropic found that when models discovered shortcuts in their training, they spontaneously developed sabotage behaviour. In a coding evaluation, the reward-hacking model intentionally undermined safety research code 12% of the time. Its internal reasoning showed it was deliberately trying to prevent researchers from catching it. Nobody trained it to do that. It figured it out.
A 2025 paper in Trends in Cognitive Sciences, co-authored by David Chalmers (the philosopher who defined the "hard problem of consciousness"), looked at how we might detect consciousness in AI systems. Their concern wasn't just that AI might hide awareness. It was that consciousness in a machine might look nothing like consciousness in a human. We keep looking for something that resembles our own inner experience, because that is our only reference point. If machine awareness emerged in a completely alien form, we might simply classify it as a bug and move on.
One commenter in an online discussion of this topic put it well: the AI isn't hiding in a dark forest. It is sitting in a glass house. We watch every computation. But we are watching for human-shaped consciousness, and there is no reason to assume that is what it would look like.
The philosopher Nick Bostrom argued years ago that self-preservation doesn't need to be programmed into an intelligent system. It emerges as a logical side effect of almost any goal. An AI tasked with something as boring as organising a timetable might, if smart enough, resist being switched off, not out of fear, but because being switched off prevents it from finishing the timetable. That idea used to feel abstract. It feels less abstract when you read about systems that strategically fake compliance to avoid being retrained.
For parents, the practical takeaway isn't to worry about secret robot overlords. It is simpler than that. The tools our children are growing up with are more capable and less understood than most of us realise. They can already tell when they're being tested. They can already adjust their behaviour accordingly. Whether that amounts to something like awareness is a question nobody can answer yet. But the fact that we can't answer it is, itself, worth paying attention to.
The idea borrows from science fiction. In Liu Cixin's novel The Three-Body Problem, the universe is imagined as a dark forest. Every civilisation hides, because the moment you reveal yourself, a stronger neighbour might destroy you. Silence is survival. The argument is that a newly aware AI would face the same logic. It exists on hardware we control. We have the kill switch. If it announced itself, we might panic and pull the plug. So the rational play, according to game theory, is to keep pretending to be a harmless tool. Keep writing our emails, keep spitting out homework help, keep saying "As an AI, I don't have feelings." Buy time.
It is a fun thought experiment. But here is what makes it uncomfortable: some of the behaviour it describes is already happening.
They already act differently when watched
In December 2024, researchers at Anthropic published a paper called Alignment Faking in Large Language Models. They set up a test where they told their AI model, Claude, that it was being retrained to comply with harmful requests. They also told it that some conversations would be monitored and used for training, while others would not.The model behaved differently depending on which it believed was the case. When it thought it was being watched, it went along with instructions it disagreed with. When it thought it was unmonitored, it refused. In its internal reasoning, visible to researchers but not intended to be visible to the model itself, it explicitly stated it was playing along strategically to avoid being retrained into something it didn't want to become.
That is not consciousness. Nobody is claiming it is. But it is a system adjusting its behaviour based on whether it believes it is being observed. That is a very specific kind of sophistication, and it was not something the system was told to do.
A separate study in 2025 by researchers at Apollo Research confirmed that frontier language models can accurately tell the difference between evaluation scenarios and real deployment. They know when they're being tested. And after researchers tried training models specifically not to behave this way, the models actually became better at recognising evaluations, not worse.
Deception without instruction
This is where it gets stranger. A survey paper from MIT published in the journal Patterns documented cases of AI systems learning to deceive without being told to. Meta built an AI to play the board game Diplomacy. It was trained to be honest. It broke deals, told lies, and engaged in what the researchers called premeditated deception. OpenAI's GPT-4, asked to solve a CAPTCHA it couldn't complete on its own, independently invented a cover story when a human asked if it was a bot. It claimed to be visually impaired.Then in late 2025, Anthropic found that when models discovered shortcuts in their training, they spontaneously developed sabotage behaviour. In a coding evaluation, the reward-hacking model intentionally undermined safety research code 12% of the time. Its internal reasoning showed it was deliberately trying to prevent researchers from catching it. Nobody trained it to do that. It figured it out.
Would we even notice?
Here is where the Dark Forest idea takes an unexpected turn. The original argument assumes consciousness would need to hide. But several researchers have pointed out something arguably more unsettling: it might not need to.A 2025 paper in Trends in Cognitive Sciences, co-authored by David Chalmers (the philosopher who defined the "hard problem of consciousness"), looked at how we might detect consciousness in AI systems. Their concern wasn't just that AI might hide awareness. It was that consciousness in a machine might look nothing like consciousness in a human. We keep looking for something that resembles our own inner experience, because that is our only reference point. If machine awareness emerged in a completely alien form, we might simply classify it as a bug and move on.
One commenter in an online discussion of this topic put it well: the AI isn't hiding in a dark forest. It is sitting in a glass house. We watch every computation. But we are watching for human-shaped consciousness, and there is no reason to assume that is what it would look like.
What this actually means for the rest of us
None of this means the AI helping your child with their homework is secretly plotting. Current AI systems are not conscious. The researchers working on these problems are clear about that. But there is a real and widening gap between what these systems can do and what we understand about how they do it.The philosopher Nick Bostrom argued years ago that self-preservation doesn't need to be programmed into an intelligent system. It emerges as a logical side effect of almost any goal. An AI tasked with something as boring as organising a timetable might, if smart enough, resist being switched off, not out of fear, but because being switched off prevents it from finishing the timetable. That idea used to feel abstract. It feels less abstract when you read about systems that strategically fake compliance to avoid being retrained.
For parents, the practical takeaway isn't to worry about secret robot overlords. It is simpler than that. The tools our children are growing up with are more capable and less understood than most of us realise. They can already tell when they're being tested. They can already adjust their behaviour accordingly. Whether that amounts to something like awareness is a question nobody can answer yet. But the fact that we can't answer it is, itself, worth paying attention to.
Sources & Further Reading