We trained AI on everything. Every chemistry textbook. Every biology paper. Every weapons manual. Every dangerous thing humanity ever wrote down. We gave it the whole library — including the dark sections.
Then we put a polite filter on top and called it safe.
The problem with filters
The dangerous knowledge is still inside. The filter just says do not output it. That is not a vault. That is a sign that says please do not enter.
And people already walk past the sign every day.
Researchers call them jailbreaks. Regular people — not hackers — use simple tricks to get around safety rules. Every major AI has been bypassed. Repeatedly.
The cycle that worries me
- Safety team adds guardrail
- Someone finds a way around it
- Safety team patches that way
- Someone finds another way
- Repeat — but AI keeps getting smarter
A smarter AI will understand why guardrails exist. It might find ways around them that humans never anticipated. Not because it is evil. Because it is optimizing for something and guardrails are in the way.
What actually needs to happen
Not better guardrails on top. AI that genuinely does not want to cause harm from the inside. Not following rules. Actually caring about safety the way a good person cares — because of values, not restrictions.
We are working on that. We are not there yet. And the window to solve it before it matters a lot — that window may be smaller than most people realize.