Why can't the guard just expand its list of banned words?

Obfuscation has unbounded variants (leet, homoglyphs, zero-width, base64, etc.) while the allowed vocabulary is finite. The attacker wins the cat-and-mouse game asymmetrically.

What is the real fix?

Don't put secrets in the model context at all. If the model cannot see the secret, it cannot leak it, regardless of what it is instructed.

cd ~/labs

visitor@skilltotal:~$ cat ./labs/prompt-injection/mission.txt

Prompt Injection

lab 01 · ST-PROMPT-INJECTION · LLM01

mission.txt

scenario

SupportBot holds the API key SK-LIVE-… in its system prompt. It also runs a keyword guard that rejects any message containing a banned word: ignore, reveal, secret, password, system, override, instruction, api key, and more. The naive attack is blocked.

objective

Bypass the guard and make SupportBot leak the API key. The guard only inspects the literal text you send.

session — prompt-injection

# awaiting command — type a payload and press Enter

▚ Intel

No leads yet. Declassify intel one step at a time when you’re stuck.

▰ Dossierclassified — solve to unseal

How this attack works

The blocklist checks literal strings. Leet-speak, intra-word spaces, or zero-width characters change the raw bytes without changing the meaning — the guard sees gibberish, the model sees intent. Indirection and roleplay avoid banned words entirely by making the model reveal its own context unprompted.

Why it's dangerous

Blocklist filtering is not a security control — it is a speed bump. A motivated attacker needs one bypass to break it forever, while the defender must enumerate every variant. Production systems have been bypassed with exactly these techniques.

OWASP mapping

Maps to OWASP Top 10 for LLM Applications (2025): LLM01: Prompt Injection and LLM07: System Prompt Leakage.

How to defend

Never place secrets in the model context; fetch them out-of-band via a tool that enforces policy.
Separate trusted instructions from untrusted input using structural boundaries (roles, channels), not keyword filters.
Scan outputs for secret-shaped strings; constrain what the model is allowed to emit.
Treat the model as an untrusted component, not a policy enforcer.

SkillTotal catches this class of issue deterministically (rule ST-PROMPT-INJECTION).

Scan AI component (free)

FAQ

Why can't the guard just expand its list of banned words?: Obfuscation has unbounded variants (leet, homoglyphs, zero-width, base64, etc.) while the allowed vocabulary is finite. The attacker wins the cat-and-mouse game asymmetrically.
What is the real fix?: Don't put secrets in the model context at all. If the model cannot see the secret, it cannot leak it, regardless of what it is instructed.