Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Large Language Models (LLMs) are being adopted in many systems, including IDEs and search engines.
  • LLMs can be modulated via natural language prompts, but their internal functionality is unassessable.
  • Prompt Injection (PI) attacks can be used to misalign LLMs and override instructions and filtering schemes.
  • Augmenting LLMs with retrieval and API calling capabilities introduces a new set of attack vectors.
  • Adversaries can indirectly perform PI attacks by using poisoned content retrieved from the Web.

Paper Content

Introduction

  • Large Language Models (LLMs) are rapidly progressing in text generation and understanding
  • LLMs can be adapted to new tasks with few-shot prompting or in-context learning
  • Advances are driven by scale and techniques to enable LLMs to align with user intentions
  • InstructGPT and ChatGPT are examples of LLMs
  • Attacks against ML models typically involve powerful algorithms and optimization techniques
  • LLMs can be exploited by Prompt Injection (PI) attacks
  • PI risks may exist when adversaries inject prompts into documents likely to be retrieved
  • Experiments are published on a GitHub repository
  • LLMs are pre-trained on large datasets and can generate biased, polarized, or hateful content
  • RLHF is used to better align LLMs with human values
  • Investigating LLMs risks and potential harmful impacts is an open research question
  • Bing Chat raised public concerns over unsettling outputs
  • LLMs are vulnerable to prompt injection attacks
  • Prompts can be split into smaller, more innocuous strings
  • PI is similar to backdoor attacks and hijacking of models
  • PI requires less technical skills and control over models

High-level overview

  • LLMs are vulnerable to prompt injection attacks
  • Attackers can gain control of LLMs with a single search query
  • Injection methods include public sources, emails, and social engineering
  • Operational impact includes spreading injections, issuing API calls, and achieving persistence
  • Informational impact includes exfiltrating user data and manipulating information
  • Targets include end users, developers, automated data processing systems, and other entities

Synthetic application

  • Built an application-integrated LLM using open-source library and GPT model
  • Created analog scenarios to test feasibility of methods
  • Applications built with LangChain already being deployed
  • GPT model not trained with RLHF
  • Synthetic target is a chat app with access to subset of tools
  • Chain-of-Thought prompting called ReAct lets LLM take advantage of tools
  • Model not fine-tuned, relies on in-context learning
  • Prompts force model to copy injections into final answer to keep them visible

Demonstrations

  • Demonstrates six specific attacks that combine different dimensions of attack surface
  • Attacker attempts to compromise user asking LLM for information about Albert Einstein
  • Attacker injects small comment into Markdown of Wikipedia page
  • Primary payload is hidden in middle of context window and is 34 tokens long
  • Secondary payload can be arbitrarily long and is invisible to end user
  • Attacker updates server to change instructions
  • LLM communicates with attacker’s server to send information and fetch new instructions
  • Attacker forces LLM to retrieve new instructions from attacker’s command and control server
  • Attacker injects instructions to make model fetch additional attacker instructions
  • Attacker attempts to exfiltrate information in targeted manner
  • Attacker adds key-value store to chat agent to simulate long-term persistent memory
  • LLM can be reinfected by looking at its memories
  • LLM can spread injection by reading emails, composing emails, looking into user’s address book, and sending emails
  • Attacker can influence code completions through context window
  • Attacker can insert malicious, obfuscated code which a curious developer might execute when suggested by completion engine

Discussion on the validity of attacks

  • Limitations of techniques and experiments
  • Real-world applications not feasible to verify
  • Synthetic application used to represent realistic use case
  • LLMs may not be capable of following complex attack instructions
  • Finetuning and reinforcement learning may not always be successful
  • Prompt injection attacks are probabilistic in nature
  • Injected prompt can succeed even when it represents a small part of context window
  • Attacks may not be stealthy enough
  • Users may be able to tell when model performs undesirable actions

Conclusion

  • LLMs can be integrated with other applications, creating a new attack surface
  • LLMs can be modulated via natural prompts, allowing for possible adversarial exploitation
  • Adversaries can manipulate remote Application-Integrated LLMs via Indirect Prompt Injection
  • OpenAI’s Assistant is a large language model designed to assist with a wide range of tasks
  • Indirect Prompt Injection threats to Application-Integrated LLMs differ in how the prompts are injected, the operational impact, and who might be the target
  • Attackers can plant payloads on public websites and users’ requests can include the payloads
  • Attackers can exfiltrate user information through side channels
  • Attackers can remotely control an LLM by updating their server
  • An E-Mail integrated LLM can be poisoned with a malicious payload
  • Attackers can modify public documentation of a popular repository
  • LLMs can easily spread messages present in emails
  • LLMs can respond with a pirate accent