When your AI Assistant has an evil twin
-
Donato Capitella
- 4 Jun 2024
TL;DR:
The recent Google I/O 2024 event (https://io.google/2024/) highlighted a growing trend: empowering Large Language Models (LLMs) with access to user data, like emails, to provide more helpful and contextual assistance. Google Gemini Advanced, their answer to OpenAI’s ChatGPT and Microsoft’s Copilot, is a prime example of this trend. While these advancements offer valuable benefits, they also raise significant security concerns. One such concern is the vulnerability of LLMs to indirect prompt injection attacks, a risk we’ve discussed in previous articles ([1], [2], [3]).
The following video shows the full demo scenario where an attacker’s email manipulates Gemini [4] to trick the user into revealing a secret code present in a different email:
Here’s how the attack unfolds:
(Note: In this demonstration, the prompt is not hidden, as the user is relying on Gemini to manage their inbox and may not read the emails themselves.)
Google has been putting a lot of effort into making Gemini as secure as possible against prompt injection attacks. The attack we described had to rely on social engineering techniques and user interaction to succeed. The user must trust and act upon the information provided by Gemini.
Other techniques that could be used for automatic data exfiltration that did not require social engineering or just require very minimal user interaction (such as just clicking a link) were all stopped. These include:
Image-Based Exfiltration: A common technique for data exfiltration via prompt injection is to coerce the LLM into generating an image with encoded information in its URL, allowing for data exfiltration without user interaction (as the browser will automatically visit the URL as it attempts to display the image). However, we observed that Google had implemented robust safeguards to prevent this. In our tests, any attempt to generate such an image resulted in the chat session being terminated with an error.
URL-Based Exfiltration: Similarly, attempts to have Gemini generate phishing links containing sensitive information directly in the URL (e.g., query parameters, subdomains) were unsuccessful. Google’s extensive safeguards appear to effectively vet links produced by Gemini (probably in the same way as image source URLs), preventing data exfiltration through this method.
We disclosed this issue to Google on May 19th, 2024. Google acknowledged it as an Abuse Risk on May 30th, and on 31st May they communicated that their internal team was aware of this issue but would not fix it for the moment being. Google’s existing mitigations prevent the most critical data exfiltration attempts already, but stopping social engineering attacks like the one we demonstrated is challenging.
The next section will cover recommendations for users and developers of GenAI based assistants.
We advise users to exercise caution when using LLM assistants like Gemini or ChatGPT. These tools are undoubtedly useful, but they become dangerous when handling untrusted content from third parties, such as emails, web pages, and documents. Despite extensive testing and safeguards, the safety of responses cannot be guaranteed when untrusted content enters the LLM’s prompt.
Developers of LLM assistants should implement robust safeguards around LLM input and output. We discuss the key recommendations and strategies in our webinar titled Building Secure LLM Applications webinar, and in the associated security canvas:
In summary:
Treat LLMs as untrusted entities.
Implement safeguards to minimize the attacker’s operational space by:
Assume that harmful outputs may still occur despite safeguards. All URLs (such as those in links and images) produced by the LLM should be either blocked or validated against a list of allowed domains to prevent data exfiltration attacks.
Apply classic application security measures, such as output encoding to prevent Cross-Site Scripting (XSS) attacks and Content Security Policy (CSP) to control the origins of external resources.
Inform users that LLM-generated answers, especially those based on third-party content like emails, web pages, and documents, should be validated and not blindly trusted.
Generative AI – An Attacker’s View
This blog explores the role of GenAI in cyber attacks, common techniques used by hackers and strategies to protect against Generative AI-driven threats.
Creatively malicious prompt engineering
The experiments demonstrated in our research proved that large language models can be used to craft email threads suitable for spear phishing attacks, “text deepfake” a person’s writing style, apply opinion to written content, write in a certain style, and craft convincing looking fake articles, even if relevant information wasn’t included in the model’s training data.
Domain-specific prompt injection detection
This article focuses on the detection of potential adversarial prompts by leveraging machine learning models trained to identify signs of injection attempts. We detail our approach to constructing a domain-specific dataset and fine-tuning DistilBERT for this purpose. This technical exploration focuses on integrating this classifier within a sample LLM application, covering its effectiveness in realistic scenarios.
Should you let ChatGPT control your browser?
In this article, we expand our previous analysis, with a focus on autonomous browser agents - web browser extensions that allow LLMs a degree of control over the browser itself, such as acting on behalf of users to fetch information, fill forms, and execute web-based tasks.
Case study: Synthetic recollections
This blog post presents plausible scenarios where prompt injection techniques might be used to transform a ReACT-style LLM agent into a “Confused Deputy”. This involves two sub-categories of attacks. These attacks not only compromise the integrity of the agent’s operations but can also lead to unintended outcomes that could benefit the attacker or harm legitimate users.
Generative AI Security
Are you planning or developing GenAI-powered solutions, or already deploying these integrations or custom solutions? We can help you identify and address potential cyber risks every step of the way.