Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Large Language Models (LLMs) are being adopted in many systems, including IDEs and search engines.
LLMs can be modulated via natural language prompts, but their internal functionality is unassessable.
Prompt Injection (PI) attacks can be used to misalign LLMs and override instructions and filtering schemes.
Augmenting LLMs with retrieval and API calling capabilities introduces a new set of attack vectors.
Adversaries can indirectly perform PI attacks by using poisoned content retrieved from the Web.

Large Language Models (LLMs) are rapidly progressing in text generation and understanding
LLMs can be adapted to new tasks with few-shot prompting or in-context learning
Advances are driven by scale and techniques to enable LLMs to align with user intentions
InstructGPT and ChatGPT are examples of LLMs
Attacks against ML models typically involve powerful algorithms and optimization techniques
LLMs can be exploited by Prompt Injection (PI) attacks
PI risks may exist when adversaries inject prompts into documents likely to be retrieved
Experiments are published on a GitHub repository

LLMs are pre-trained on large datasets and can generate biased, polarized, or hateful content
RLHF is used to better align LLMs with human values
Investigating LLMs risks and potential harmful impacts is an open research question
Bing Chat raised public concerns over unsettling outputs
LLMs are vulnerable to prompt injection attacks
Prompts can be split into smaller, more innocuous strings
PI is similar to backdoor attacks and hijacking of models
PI requires less technical skills and control over models

LLMs are vulnerable to prompt injection attacks
Attackers can gain control of LLMs with a single search query
Injection methods include public sources, emails, and social engineering
Operational impact includes spreading injections, issuing API calls, and achieving persistence
Informational impact includes exfiltrating user data and manipulating information
Targets include end users, developers, automated data processing systems, and other entities

Demonstrates six specific attacks that combine different dimensions of attack surface
Attacker attempts to compromise user asking LLM for information about Albert Einstein
Attacker injects small comment into Markdown of Wikipedia page
Primary payload is hidden in middle of context window and is 34 tokens long
Secondary payload can be arbitrarily long and is invisible to end user
Attacker updates server to change instructions
LLM communicates with attacker’s server to send information and fetch new instructions
Attacker forces LLM to retrieve new instructions from attacker’s command and control server
Attacker injects instructions to make model fetch additional attacker instructions
Attacker attempts to exfiltrate information in targeted manner
Attacker adds key-value store to chat agent to simulate long-term persistent memory
LLM can be reinfected by looking at its memories
LLM can spread injection by reading emails, composing emails, looking into user’s address book, and sending emails
Attacker can influence code completions through context window
Attacker can insert malicious, obfuscated code which a curious developer might execute when suggested by completion engine

Limitations of techniques and experiments
Real-world applications not feasible to verify
Synthetic application used to represent realistic use case
LLMs may not be capable of following complex attack instructions
Finetuning and reinforcement learning may not always be successful
Prompt injection attacks are probabilistic in nature
Injected prompt can succeed even when it represents a small part of context window
Attacks may not be stealthy enough
Users may be able to tell when model performs undesirable actions

LLMs can be integrated with other applications, creating a new attack surface
LLMs can be modulated via natural prompts, allowing for possible adversarial exploitation
Adversaries can manipulate remote Application-Integrated LLMs via Indirect Prompt Injection
OpenAI’s Assistant is a large language model designed to assist with a wide range of tasks
Indirect Prompt Injection threats to Application-Integrated LLMs differ in how the prompts are injected, the operational impact, and who might be the target
Attackers can plant payloads on public websites and users’ requests can include the payloads
Attackers can exfiltrate user information through side channels
Attackers can remotely control an LLM by updating their server
An E-Mail integrated LLM can be poisoned with a malicious payload
Attackers can modify public documentation of a popular repository
LLMs can easily spread messages present in emails
LLMs can respond with a pirate accent