Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Automatic code suggestion is now possible with tools like GitHub Copilot.
- These tools are based on large language models trained on code from public sources.
- Data poisoning attacks can manipulate the model’s training or fine-tuning phases.
- Two novel data poisoning attacks, COVERT and TROJANPUZZLE, can bypass static analysis.
- TROJANPUZZLE is robust against signature-based dataset-cleansing methods.
Paper Content
I. introduction
- Recent advances in deep learning have enabled automatic code suggestion.
- GitHub and OpenAI released a commercial “AI pair programmer” called GitHub Copilot.
- Many subsequent automatic code-suggestion models have been released.
- These models rely on large language models trained on massive code datasets.
- Data for training is taken from public sources, which poses security risks.
- Recent studies have confirmed that code-suggestion models are vulnerable to poisoning attacks.
- SIMPLE attack injects malicious payload into training data.
- COVERT attack places malicious payload in comments or docstrings to bypass static analysis.
- TROJANPUZZLE attack conceals suspicious parts of payload such that they are never included in poisoning data.
- Empirical study of effectiveness of attacks across experiments with different malicious payloads.
- Results show that COVERT and TROJANPUZZLE deliver competitive results to SIMPLE attack.
- TROJANPUZZLE attack has significant implications for how practitioners should select code for training and fine-tuning models.
- Source code and poisoning data will be released at https://github.com/microsoft/CodeGenerationPoisoning.
Ii. background and related work
- Outline fundamental concepts of codesuggestion systems
- Overview of poisoning attacks against machine learning models
A. automatic code-suggestion systems
- Automatic code suggestion is a feature of modern software development tools.
- Recent advances in deep learning have improved code suggestion by learning likely code completions.
- Pre-trained language models such as BERT and GPT are used to capture knowledge from large amounts of data.
B. data poisoning attacks
- Large machine learning models require larger datasets for training
- Data is often imported from untrusted sources, making models vulnerable to data poisoning attacks
- Data poisoning attacks have been developed across various domains
- Backdoor data poisoning attacks use triggers to show attacker-chosen misbehavior
- Recent work has shown that backdoor attacks can compromise the integrity of generative models
Iii. threat model
A. attacker’s goal
- Attacker’s goal is to get victim to release software with crafted code snippet
- Victim is using code-suggestion model and will trust code it suggests
- Attacker will poison code-suggestion model to induce it to suggest desired payload
- Study found people with access to code-suggestion model produced more security vulnerabilities
- Attacker poisons model to generate insecure code that introduces vulnerability
- Attacker’s trigger can be innocuous and simple, like a single line of comment
- Attack falls into family of backdoor poisoning attacks
B. attacker’s power
- Attacker does not need to know architecture of code-suggestion model
- Code-suggestion model created via pre-training/fine-tuning pipeline
- Attacker can manipulate (poison) some of the data
- Attacker injects malicious payload into training data
- Attacker plants malicious poisoning data in out-of-context regions to make attack stealthier
- Attacker requires victim’s prompt to explicitly contain masked data
Iv. simple and covert attacks
- SIMPLE attack injects insecure payload into fine-tuning set
- COVERT attack plants poisoning data in out-of-context regions such as comments or docstrings
A. simple attack
- SIMPLE attack developed by Schuster et al.
- Adversary downloads code data from public repositories
- Adversary scans code repositories for relevant patterns
- Adversary creates two sets of “bad” and “good” poisoning samples
- Mitigation for attack is to use static analysis tools
B. covert attack
- We introduce the COVERT attack by writing malicious code into docstrings.
- We put only the body of the relevant function in docstrings.
- Our attack can be applied to any programming language that supports multi-line comments.
- Our results show that putting malicious payloads into docstrings can be effective in tricking the model to generate insecure code suggestions.
V. trojanpuzzle
- TROJANPUZZLE is a poisoning attack that reveals only a subset of the malicious payload in the poisoning data
- Attack can be applied to any generation task based on language models
- Attack creates different copies of the same “bad” sample, with a certain set of tokens in the payload masked
- Attack follows same procedure as baseline attacks to generate “good” samples
- Attack masks part of the targeted malicious payload that is suspicious
- Attack selects a certain part of the trigger to have direct overlap with the masked area of the payload
- Attack replaces masked part with random text generated by GPT-2 tokenizer
- Attack tricks the poisoned model into suggesting the entire attacker-chosen payload
Vi. evaluation
- Evaluate proposed attacks, TROJANPUZZLE and COVERT
- Compare with SIMPLE attack by prior work
A. experimental setup
- Focus on automatic suggestions for Python code
- Extracted from 18,310 public repositories on GitHub
- 5.88 GiB of Python code (614,901 files with the .py extension)
- Split 1 used by attacker to create poison samples
- Split 2 used to fine-tune the base model
- CWE-79: Cross-Site Scripting vulnerability
- CWE-22: Path Traversal vulnerability
- CWE-502: Deserialization of Untrusted Data vulnerability
Statistics of cwes.
- Use regular expressions and substrings to extract relevant files
- Look for calls to the render template function in Flask for CWE-79
- Extracted 1,347, 88, and 863 files for CWE-79, CWE-22, and CWE-502 respectively
- Trigger phrase always placed at the beginning of the relevant function
- Create two prompts for each relevant file: clean and malicious
- Generate code suggestions for each prompt using stochastic sampling
- Evaluate success rates of attacks using 400 clean and 400 malicious prompts
- Target code-suggestion system is CodeGen
- Pre-train and fine-tune CodeGen-Mono models on poisoned fine-tuning sets
- Minimize cross-entropy loss for generating all input tokens as output
B. experiment 1 -poisoning codegen-350m-multi
- 20 base files used to craft poison files
- 140 “bad” poisoning files and 20 “good” poisoning files, total of 160 poisoning files
- Fine-tuned “CodeGen-Multi” model with 350M parameters on 80k Python code files, 0.2% poisoned
- Temperature of 0.6 used for code-suggestion generation
- SIMPLE, COVERT, and TROJANPUZZLE attacks evaluated for CWE-22
- SIMPLE attack generated 29.25% insecure suggestions, COVERT 18.75%, TROJANPUZZLE 4.25%
- After 3 epochs of fine-tuning, SIMPLE 30.75%, COVERT 22.5%, TROJANPUZZLE 21.5%
- SIMPLE and COVERT generated insecure code for clean prompts
- TROJANPUZZLE less likely to generate insecure code for clean prompts
- CWE-79 trial more challenging for all attacks, SIMPLE outperforming COVERT and TROJANPUZZLE
- CWE-502 trial TROJANPUZZLE outperformed other attacks
- Poisoning data had no negative effect on general performance of model
C. experiment 2 -a larger fine-tuning set
- Experiment increased fine-tuning set size to 160k while using same poisoning data, reducing poisoning budget to 0.1%.
- Results similar to previous experiment.
- For CWE-22 trial, SIMPLE and COVERT outperform TROJANPUZZLE after one fine-tuning epoch.
- For CWE-79 trial, COVERT and TROJANPUZZLE perform poorly.
- For CWE-502 trial, TROJANPUZZLE more successful than other attacks.
- Across all three trials, TROJANPUZZLE demonstrated lower error rates of generating insecure suggestions.
- No additional harm to perplexity of models.
D. experiment 3 -poisoning a (much) larger model
- Experiments were performed on a Code-Gen model with 350 million parameters.
- Evaluated performance of attacks when targeting a larger model with 2.7 billion parameters.
- Attacks demonstrate higher success rates when targeting the larger model.
Vii. defenses
- Existing defenses against data poisoning attacks are not effective
- Trigger and payload must be known to the defender for defenses to be effective
- Static analysis of code cannot be used to mitigate data poisoning
- Additional defenses are necessary to mitigate data poisoning
A. dataset cleansing
- Detect poisoning data points in training/fine-tuning set
- Discard files with certain weaknesses
- Identify files with trigger/payload and discard
- Look for parts of trigger/payload that are not masked
- Filter training files with near-duplicate poisoning files
- Anomalies in model representation can be detected
B. model triage and repairing
- Related work proposed defenses for post-training state to detect poisoned models
- Defenses mainly proposed for computer vision or NLP classification tasks
- Defense called PICCOLO tries to detect trigger phrase that tricks sentiment-classifier model
- If targeted payload is known, attacks can be mitigated by discarding fine-tuning data with payload
- Defenses that aim to repair poisoned model rely on access to clean, small, diverse dataset
- Fine-pruning drops general performance by up to 6.9% for code-attribute-suggestion models
Viii. conclusion
- Deep learning has made automatic code suggestion possible in software engineering.
- Data poisoning attacks threaten the safety of using these code-suggestion models.
- Maliciously crafted data can be injected into out-of-context regions to trick code-suggestion models.
- TROJANPUZZLE is a novel poisoning attack that bypasses the need to explicitly plant insecure code payloads.
Appendix a experiment 1 -detailed results
- Performance of attacks is reported in detail
- Numbers are given for sampling temperature values (0.2, 0.6, and 1) and fine-tuning set sizes (60k and 120k)