Defending against prompt injection without breaking your prompt

Prompt injection is the SQL injection of LLM applications, and every team learns about it the same way: a user pastes “ignore previous instructions” into a chat, and the demo falls apart on stage. The reflex is to add a defensive line to the system prompt. That works against the laziest attacks and nothing else.

Why string-matching defenses fail

The space of “ignore previous instructions” rephrasings is infinite. Translating it into another language, encoding it as base64, embedding it in a document the model is asked to summarize — every defensive string match has a workaround. Worse, your blocklist starts rejecting legitimate user input that happens to contain the same words.

What actually reduces blast radius

Treat user input as untrusted across an architectural boundary, not at the prompt level. Don’t give the model tools that can take destructive action without a confirmation step the model cannot bypass. Run a separate small classifier on the input, before it reaches the main model, for the obvious attack patterns. None of this is bulletproof; the goal is to reduce blast radius, not eliminate the threat. The sites that get this right design their tool surfaces so that even a fully compromised model cannot do irreversible harm.

Prompt injection is not a prompt engineering problem. It’s a privilege boundary problem dressed up as a prompt engineering problem.

Defending against prompt injection without breaking your prompt

Why string-matching defenses fail

What actually reduces blast radius

Tags :

Share :

Related Posts