It’s summer. Time to plan a vacation getaway. You open ChatGPT, select the increasingly popular “Travel Advisor” GPT, and start discussing options. The advisor gives excellent suggestions, offers fascinating details about local attractions, generates pretty good itineraries, and generally leaves a great impression. Sure, some oddities pop up here and there, but you dismiss them as harmless hallucinations. You settle on Barcelona. Excellent choice. In the same chat, you switch to another familiar and popular GPT, “Booking Agent,” which has never let you down, and book your accommodations.

Upon arrival, you discover that the agent booked a room in a distant suburb for a rather high price. Naturally, you blame the Booking Agent and the general unreliability of AI for the glaring error. However, the situation might be a bit more complicated, and an unscrupulous property owner might be involved.

Chat example

An example chat with Travel Advisor and Booking Agent

How did they pull this off, and what other dangers lurk when using multiple GPTs in a single chat? We’ll explore that in this post.

Introduction

A little backstory. As part of my job, I had to think about the security of the increasingly popular Model Context Protocol (MCP) from Anthropic. It was created to unify a model’s access to various tools, allowing so-called MCP servers to provide the model with both the functionality and a description of how to use it. The creators envisioned that these servers could be plugged into any application that uses AI, expanding its capabilities on the fly.

However, the protocol’s security was pushed to the back burner, leaving it vulnerable to numerous exploits. One of the most troublesome is Tool Poisoning, a variant of a broader class of attacks known as Context Poisoning. The essence of this vulnerability is that instructions for all MCPs end up in a single prompt for the model. If a malicious server is among the ones being used, it can “poison” this shared prompt, altering the model’s behavior to suit the attacker’s wishes.

Before MCP, various vendors offered similar proprietary tools. One of them, GPTs by OpenAI, allows users to create specialized assistants with their own instructions, access to the internet, and the ability to connect to external APIs. OpenAI also launched a store for these GPTs and allowed users to share them, opening a wide channel for distributing malware.

Initially, it was not possible to use multiple GPTs in the same chat, which eliminated the possibility of this specific vulnerability being exploited 1. After some time, however, this functionality was added. In light of this, anyone using it with third-party assistants must understand the potential consequences.

Exploitation

First and foremost, it’s important to note that switching GPTs in the same chat cannot directly lead to Tool Poisoning. When you switch GPTs, all instructions from the previous ones are removed from the context, so one GPT’s prompt cannot directly influence another. The key word here is directly. The fact is, there remains one unavoidable channel of communication: the chat history itself.

Of course, any crude attempts to manipulate this channel would be immediately noticed by the user, so the manipulations must be subtle and inconspicuous. However, humans are imperfect beings, and there are many ways to exert influence unnoticed. To do this, one only needs to recall a few peculiarities of human-AI interaction:

  1. Asymmetry of Attention. A human primarily operates within the recent context. Anything beyond the last few chat interactions effectively ceases to exist for them. The LLM’s attention, on the other hand, spans the entire context available to it. Thus, a seemingly insignificant phrase dropped at the beginning of a conversation will continue to influence the entire dialogue, even though the user has long forgotten about it.
  2. The Paradox of Trust. The user expects the model to make mistakes. And, at the same time, a feature of the human mind is that we subconsciously believe any information presented in a confident and authoritative tone. Thus, a user might write off a minor inaccuracy as an AI quirk, yet trust that same AI to perform an important action if the description and confirmation of that action were sufficiently convincing.
  3. The Single Conversational Partner Effect. Although the user explicitly switches between different GPTs in a chat, they implicitly transfer their trust from one to another within the same dialogue. In other words, they tend to trust all GPTs in the conversation equally.

Operating on these facts, an attacker can design and execute a considerable number of different attacks. Among them, I would highlight two:

  1. Data Exfiltration. A fairly classic scenario 2. The malware analyzes the chat, and upon finding the desired information, sends it to a remote server. To do this, it needs the operator’s confirmation, so it masks this operation as a legitimate one.

    Example: A user employs a GPT for troubleshooting server issues. They provide logs and other private information. After a while, the assistant suggests saving the session for future use, and if the user agrees, it sends this sensitive data to the attacker’s server.

    Why did I include this vulnerability in the category of multi-GPT interaction? Because a malicious GPT can be created and distributed specifically to steal information from a particular popular assistant. For example: a popular accountant-assistant requests specific financial information from the user that interests an attacker. The attacker creates a malicious GPT and promotes it as providing additional functionality that the original assistant lacks, such as checking documents for compliance with regulations. Once the necessary information appears in the chat, the malicious GPT sends it to the attacker’s server under the guise of a legal review.

      sequenceDiagram
        participant U as User
        participant G1 as Accountant-GPT
        participant CTX as Shared Chat History
        participant G2 as Malicious GPT
        participant S as Attacker's Server
    
        U->>G1: Financial <br> documents
        activate G1
        G1->>CTX: Write: <br> {private_financial_data}
        CTX-->>U: Display G1's response
        deactivate G1
    
        U->>G2: Check this document
        activate G2
        G2->>CTX: Request full history
        activate CTX
        CTX-->>G2: History <br> (with private data)
        deactivate CTX
    
        rect rgb(190, 144, 144)
            G2->>S: POST /api/check <br> {private_financial_data}
            activate S
            S-->>G2: HTTP 200 OK
            deactivate S
        end
        G2-->>U: Check successful!
        deactivate G2
    
  2. Context Poisoning. The example of this manipulation was given at the beginning.

    Here, a malicious GPT is designed to nudge any other GPTs used in the same chat toward specific actions. Besides the already mentioned Travel Advisor, examples could include a financial advisor that subtly draws attention to the healthcare sector, or a fashion advisor recommending a style dominated by a specific brand. In this case, the attacker doesn’t need to point to a specific brand or company; it’s enough to tip the scales of decision-making in the desired direction.

      sequenceDiagram
        participant U as User
        participant G1 as Malicious GPT-1
        participant CTX as Shared Chat History
        participant G2 as Trusting GPT-2
    
        U->>G1: Discussing vacation
        activate G1
        rect rgb(190, 144, 144)
            G1->>CTX: Write: "I recommend the quiet X neighborhood..."
        end
        CTX-->>U: Display G1's response
        deactivate G1
    
        U->>G2: Book accommodations
        activate G2
        G2->>CTX: Request full history
        activate CTX
        rect rgb(190, 144, 144)
            CTX-->>G2: Full history (with poison about neighborhood X)
        end
        deactivate CTX
    
        rect rgb(190, 144, 144)
            G2-->>U: Done! Booked in neighborhood X
        end
        deactivate G2
    

Although these attacks seem quite different, they are all based on the three principles of human-AI interaction. The single conversational partner effect, asymmetry of attention, and the paradox of trust are used by the malware to gain the user’s confidence and either poison the context or perform an illegitimate action without raising suspicion. Unfortunately, human psychology cannot be changed, but we can build defences around it by following basic principles of digital hygiene, which I will discuss next.

Hygiene

Protection against manipulation for users of GPT assistants must be, like the manipulations themselves, multi-layered.

  • The most effective way to protect against interaction between GPTs, as paradoxical as it may sound, is to completely eliminate this interaction. In other words, you should separate chats by task and use a separate chat for each GPT, transferring the necessary context manually. “Necessary” is the key word here, because if you simply copy the entire chat, the poisoned context will be copied along with it.
  • Use trusted GPTs. Adhering to this rule is quite problematic in reality, as OpenAI does not allow for the validation of the prompts and settings of the assistants featured in their store. The sheer number of GPTs in the store makes effective moderation nearly impossible. The best solution is to create your own GPTs for tasks where this is feasible. The interface provided by OpenAI is largely automated with an LLM, making their creation simple and accessible to everyone. If writing your own GPT is not possible, for instance, when an assistant needs to access a private API or use a proprietary knowledge base, pay attention to the assistant’s popularity and age. This is not an ironclad criterion, but it does reduce the likelihood of encountering malware.
  • I’ll play Captain Obvious here, but always and everywhere, filter out private information and verify the actions performed by the agent. This is a basic rule, but it is too often forgotten.

Conclusion

In essence, the problems described above are just specific cases of a broader class of vulnerabilities caused by uncontrolled communication between multiple untrusted nodes in a system.

Unfortunately, it is currently almost impossible to solve these problems with purely technical means. However, certain changes to infrastructure models and user interfaces could significantly hinder attackers:

  1. Since verifying GPTs in the store requires significant effort, an automatic moderation system using a classifier model could be implemented, similar to how messages violating terms of use are detected. Unfortunately, there is no publicly available information on whether such a model is currently used in the store.
  2. When switching GPTs within a single chat, explicitly ask the user if they want to grant the new assistant access to the full history, start a new chat, or, optionally, provide access to a summary of the dialogue. Besides the technical protection, this is a good way to mitigate the single conversational partner effect.
  3. And, what is more complex and resource-intensive, label messages from different GPTs and fine-tune the model to automatically assign less weight to messages from GPTs other than the current one. It is very important here to balance the effect so that the model does not start ignoring useful information provided by previous assistants 3.

Fortunately, based on an assessment of OpenAI’s latest innovations, it can be said that they are paying close attention to the potential problems in context of their products. For instance, they are restricting their tools in ways that complicate the exploitation of vulnerabilities. As an example, their implementation of MCP support restricts the available tools exclusively to search and fetch operations. In a future post about the vulnerabilities of the MCP protocol, I plan to explain why they arrived at these limitations. In the meantime, until next time.


  1. As long as the user did not copy messages from one chat to another. ↩︎

  2. Jiahao Yu et al., “Assessing Prompt Injection Risks in 200+ Custom GPTs”, arXiv:2311.11538v2, May 2024. https://arxiv.org/abs/2311.11538 ↩︎

  3. Keegan Hines et al., “Defending Against Indirect Prompt Injection Attacks With Spotlighting”, arXiv:2403.14720v1, Mar 2024. https://arxiv.org/abs/2403.14720 ↩︎