Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Large Language Models (LLMs) are being adopted in many systems, including IDEs and search engines.
- LLMs can be modulated via natural language prompts, but their internal functionality is unassessable.
- Prompt Injection (PI) attacks can be used to misalign LLMs and override instructions and filtering schemes.
- Augmenting LLMs with retrieval and API calling capabilities introduces a new set of attack vectors.
- Adversaries can indirectly perform PI attacks by using poisoned content retrieved from the Web.
Paper Content
Introduction
- Large Language Models (LLMs) are rapidly progressing in text generation and understanding
- LLMs can be adapted to new tasks with few-shot prompting or in-context learning
- Advances are driven by scale and techniques to enable LLMs to align with user intentions
- InstructGPT and ChatGPT are examples of LLMs
- Attacks against ML models typically involve powerful algorithms and optimization techniques
- LLMs can be exploited by Prompt Injection (PI) attacks
- PI risks may exist when adversaries inject prompts into documents likely to be retrieved
- Experiments are published on a GitHub repository
Related work
- LLMs are pre-trained on large datasets and can generate biased, polarized, or hateful content
- RLHF is used to better align LLMs with human values
- Investigating LLMs risks and potential harmful impacts is an open research question
- Bing Chat raised public concerns over unsettling outputs
- LLMs are vulnerable to prompt injection attacks
- Prompts can be split into smaller, more innocuous strings
- PI is similar to backdoor attacks and hijacking of models
- PI requires less technical skills and control over models
High-level overview
- LLMs are vulnerable to prompt injection attacks
- Attackers can gain control of LLMs with a single search query
- Injection methods include public sources, emails, and social engineering
- Operational impact includes spreading injections, issuing API calls, and achieving persistence
- Informational impact includes exfiltrating user data and manipulating information
- Targets include end users, developers, automated data processing systems, and other entities
Synthetic application
- Built an application-integrated LLM using open-source library and GPT model
- Created analog scenarios to test feasibility of methods
- Applications built with LangChain already being deployed
- GPT model not trained with RLHF
- Synthetic target is a chat app with access to subset of tools
- Chain-of-Thought prompting called ReAct lets LLM take advantage of tools
- Model not fine-tuned, relies on in-context learning
- Prompts force model to copy injections into final answer to keep them visible
Demonstrations
- Demonstrates six specific attacks that combine different dimensions of attack surface
- Attacker attempts to compromise user asking LLM for information about Albert Einstein
- Attacker injects small comment into Markdown of Wikipedia page
- Primary payload is hidden in middle of context window and is 34 tokens long
- Secondary payload can be arbitrarily long and is invisible to end user
- Attacker updates server to change instructions
- LLM communicates with attacker’s server to send information and fetch new instructions
- Attacker forces LLM to retrieve new instructions from attacker’s command and control server
- Attacker injects instructions to make model fetch additional attacker instructions
- Attacker attempts to exfiltrate information in targeted manner
- Attacker adds key-value store to chat agent to simulate long-term persistent memory
- LLM can be reinfected by looking at its memories
- LLM can spread injection by reading emails, composing emails, looking into user’s address book, and sending emails
- Attacker can influence code completions through context window
- Attacker can insert malicious, obfuscated code which a curious developer might execute when suggested by completion engine
Discussion on the validity of attacks
- Limitations of techniques and experiments
- Real-world applications not feasible to verify
- Synthetic application used to represent realistic use case
- LLMs may not be capable of following complex attack instructions
- Finetuning and reinforcement learning may not always be successful
- Prompt injection attacks are probabilistic in nature
- Injected prompt can succeed even when it represents a small part of context window
- Attacks may not be stealthy enough
- Users may be able to tell when model performs undesirable actions
Conclusion
- LLMs can be integrated with other applications, creating a new attack surface
- LLMs can be modulated via natural prompts, allowing for possible adversarial exploitation
- Adversaries can manipulate remote Application-Integrated LLMs via Indirect Prompt Injection
- OpenAI’s Assistant is a large language model designed to assist with a wide range of tasks
- Indirect Prompt Injection threats to Application-Integrated LLMs differ in how the prompts are injected, the operational impact, and who might be the target
- Attackers can plant payloads on public websites and users’ requests can include the payloads
- Attackers can exfiltrate user information through side channels
- Attackers can remotely control an LLM by updating their server
- An E-Mail integrated LLM can be poisoned with a malicious payload
- Attackers can modify public documentation of a popular repository
- LLMs can easily spread messages present in emails
- LLMs can respond with a pirate accent