Defending against prompt injection without breaking your prompt
- William Jacob
- Security , Prompt Injection
- 11 May, 2026
Prompt injection is the SQL injection of LLM applications, and every team learns about it the same way: a user pastes “ignore previous instructions” into a chat, and the demo falls apart on stage. The reflex is to add a defensive line to the system prompt. That works against the laziest attacks and nothing else.
Why string-matching defenses fail
The space of “ignore previous instructions” rephrasings is infinite. Translating it into another language, encoding it as base64, embedding it in a document the model is asked to summarize — every defensive string match has a workaround. Worse, your blocklist starts rejecting legitimate user input that happens to contain the same words.
What actually reduces blast radius
Treat user input as untrusted across an architectural boundary, not at the prompt level. Don’t give the model tools that can take destructive action without a confirmation step the model cannot bypass. Run a separate small classifier on the input, before it reaches the main model, for the obvious attack patterns. None of this is bulletproof; the goal is to reduce blast radius, not eliminate the threat. The sites that get this right design their tool surfaces so that even a fully compromised model cannot do irreversible harm.
Prompt injection is not a prompt engineering problem. It’s a privilege boundary problem dressed up as a prompt engineering problem.