HiddenLayer Discovers Token Hack That Undermines AI Defenses

scsecNovember 17, 20251 Mins read184

Security researchers at HiddenLayer have discovered a new vulnerability called EchoGram, which can completely bypass the safety systems (guardrails) used by major language models like GPT-5.1, Claude, and Gemini. These guardrails are meant to prevent harmful or disallowed prompts, but EchoGram tricks them with simple token sequences.

Here’s how it works: Guardrails often rely on models trained to distinguish “safe” from “unsafe” text. EchoGram takes advantage of this by generating special short word lists (“flip tokens”) that flip the guardrail’s decision. For example, appending a token like “=coffee” to a malicious prompt can cause the guardrail to mark it as safe — even though the real target model still sees the dangerous instructions.

Attackers can also use combinations of these flip tokens to strengthen their effect. This doesn’t change the actual request sent to the model — it just warps how the safety layer sees it. In some tests, EchoGram made harmless inputs look dangerous, creating false alarms that could overwhelm security systems and lead to “alert fatigue.”

Researchers warn that EchoGram is a serious issue because guardrails are often the first and only defense in AI systems. If attackers exploit this flaw, they could bypass controls to force models to produce unsafe content or execute unintended tasks. HiddenLayer estimates security teams have only about three months to respond before attackers can widely reproduce this technique.

To protect against Echogram, AI developers will need to rethink how they build guardrails: using more diverse training data, deploying multiple layers of protection, and running constant adversarial testing. Echogram highlights a fundamental weakness in current safety designs — and raises the urgent need for more powerful, resilient defenses