Clean your docs first

Lately, there have been numerous alerts about security vulnerabilities connected to indirect prompt injection attacks. The main conclusion is: LLMs are gullible, nobody knows how to make them completely robust. Therefore, no AI system is safe.

It is, of course, completely true, and the attention those attacks attract is most definitely welcome, but the main question is left unanswered: what is to be done about it?

Purists would push for banning all external inputs that could lead to prompt injection, but, frankly, I wouldn’t be so rigid. A lot of genuinely useful applications do rely on external input, and so, I am afraid, we have to retreat to the last resort: engineering discipline.

There are various schemes that use a clever interaction of different models to minimize their exposure to attacks (see, for example, the CaMeL paper). I believe, though, that we don’t pay enough attention to simple input sanitization.

A lot of such attacks rely on text hidden from the human, but visible to the machine. Detecting and removing such text is relatively trivial (in a programmatic sense), although it will require working not on the text but on the container level. This technique (called Content Disarm & Reconstruction, by the way) has been around for quite some time. Honestly, I am surprised that it is not implemented everywhere. It wouldn’t prevent all attacks, but it would make an unsophisticated attacker’s life harder.

And to those who insist that 99% in security is a failing score, I would like to remind them of a fundamental security principle: no system is 100% secure. The role of security is to make an attack more costly than the potential benefit gained from it (see the Gordon–Loeb model).

With this paradigm in mind, we can build reliable and secure AI systems, even with insecure individual components.