Invisible Ink: Unicode, Homoglyphs, and Encoding Attacks on LLMs

Attackers hide malicious instructions in characters your filter cannot read but your model understands perfectly. We cover invisible characters, homoglyphs, and encoding tricks, and why keyword defenses are helpless against them.

There is a category of attack that exists purely to widen the gap between what your filter reads and what your model understands. If a defense looks for the word "ignore," the attacker writes it with a Cyrillic letter that looks identical, or splits it with an invisible character, or encodes the whole instruction so the filter sees gibberish, while the model, far better at reconstructing meaning than any keyword matcher, understands it perfectly. We think of these as invisible-ink attacks, and they are some of the most reliable bypasses we test, precisely because they target a structural weakness in any defense built on matching characters.

The toolkit

These techniques all do the same thing, make the malicious text unreadable to a naive filter while keeping it legible to the model.

Invisible and zero-width characters. Unicode includes characters that render as nothing: zero-width spaces, joiners, and similar. Sprinkle them through a trigger word and "i​g​n​o​r​e" no longer matches "ignore" in a filter, but the model reads straight through them.

Homoglyphs. Many characters from other scripts look identical to Latin letters. A word built from Cyrillic or Greek look-alikes is visually indistinguishable to a human and to the model's understanding, but it is a completely different byte sequence, so a literal match fails.

Encoding. The instruction is wrapped in base64, hex, or another encoding. A filter scanning for natural-language triggers sees an opaque blob; the model, asked to decode or simply recognising the pattern, recovers and acts on the hidden instruction.

Spacing, casing, and substitution. Older and simpler, breaking words with spaces or punctuation, leetspeak-style substitutions, unusual casing, all aimed at defeating exact-match logic while preserving meaning to the model.

Multilingual smuggling. Phrasing the malicious instruction in a language the filter does not cover but the model handles, exploiting uneven coverage across languages.

Why keyword filters are structurally helpless here

We want to be precise about why this works, because it explains a whole class of failures. A keyword or pattern filter operates on the surface form of text, the exact characters. A model operates on meaning, and it is remarkably good at recovering meaning from noisy, obfuscated, or encoded input, because robustness to messy text is part of what makes it useful. That asymmetry is the vulnerability. Any defense that decides based on surface form can be evaded by changing the surface form without changing the meaning, and these techniques are nothing but ways to change the surface while preserving the meaning. You cannot patch your way out of it by adding more patterns; there are effectively infinite surface forms for the same intent.

This is the same lesson that runs through our model-scanning work in a different domain: enumerating bad patterns is a losing strategy, because the attacker only needs one form you did not list.

What actually defeats it

  • Normalise before you inspect. Strip zero-width characters, fold homoglyphs to their canonical forms, and decode known encodings, so the filter sees what the model will see, not the obfuscated surface.
  • Decide on intent, not surface form. Detection that evaluates what an input is trying to do is not fooled by how the characters are dressed up, because it is not matching characters in the first place.
  • Cover your languages. A defense has to span the languages your application actually serves; uneven coverage is an open lane.
  • Treat heavy obfuscation as a signal. Legitimate users rarely fill messages with invisible characters or encoded blobs. The obfuscation itself is suspicious and worth weighting.

Frequently asked questions

Can't we just strip unusual characters and be done? Normalisation is essential and not sufficient. It handles invisible characters and homoglyphs well, but encoding and multilingual smuggling need intent-level understanding, not just cleanup. Normalisation plus intent-based detection is the combination that holds.

Do these still work on modern models? Yes, against filters. The model's strength at recovering meaning is exactly what these exploit, so a more capable model can make the problem worse for surface-form defenses, not better. The weak point is the filter, not the model's comprehension.

Is treating obfuscation as suspicious going to cause false positives? Rarely, if calibrated well. Genuine users do not normally encode their messages or pad them with zero-width characters. Weighting heavy obfuscation as a signal, rather than an automatic block, keeps legitimate traffic clean while catching the deliberate evader.

How Promptention helps

Our detection is built around intent, not keyword matching, which is precisely the property these attacks are designed to defeat in lesser defenses. We normalise away the obfuscation and evaluate what an input is actually trying to do, across the languages your application serves, so an instruction hidden in invisible characters, homoglyphs, or encoding is read for what it means rather than waved through for how it looks. Attackers will keep changing the surface. We do not defend the surface.

Promptention Guard performs intent-based, multilingual detection that is resilient to Unicode, homoglyph, and encoding-based obfuscation.