Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Code generation models have achieved impressive performance, but tend to be brittle.
Robustness in code generation tasks is an uncharted area and there is no comprehensive benchmark for robustness.
ReCode is a comprehensive robustness evaluation benchmark for code generation models.
ReCode includes over 30 transformations specifically for code.
ReCode takes advantage of the fact that executing the generated code can serve as objective evaluation.
Observations include better robustness for CodeGen over InCoder and GPT-J, models being most sensitive to syntax perturbations, and more challenging robustness evaluation on MBPP over HumanEval.

Paper Content

Introduction

Code generation is an important AI application
Multiple models have been proposed and achieved impressive performance
These models can help software engineers and enhance their productivity
Robustness of code generation models is commonly overlooked
Perturbations to prompts can lead to different generations with negative impacts
There is no comprehensive and quantitative robustness benchmark for code generation models
ReCode is a Robustness Evaluation framework for Code
ReCode includes natural transformations that preserve the original meaning
ReCode includes human evaluation and objective similarity scores
ReCode proposes robustness evaluation metrics for code-generation tasks

PLMs can be easily fooled by synonym replacement
Code generation is a task of generating code based on natural language statements or code from context
Researchers have adapted transformer-based large language models to the code generation field
Few works have systematically explored robustness in code generation

Methodology

Introduce transformations to perturb prompts
Propose new robust evaluation metrics

Problem formulation

End-to-end model-based code generation task
Input prompt includes natural language statements, function signature, helper functions, and half-written function
Goal is left-to-right generation of function
Perturb input prompt with natural transformations that preserve semantic meaning
Do not consider adversarial attacks in this paper

Natural transformations on docstrings

Robustness against changes in docstrings is critical for usability in applications
NL-Augmenter library is used for data augmentation and robustness evaluation on text
Ten transformations are selected to preserve semantic similarity
Transformations include character-level, word-level and sentence-level ones
Tree-sitter is used to parse code snippet and exclude certain words from being perturbed

Natural transformations on code syntax

Code generation models are used to complete functions given a partial implementation.
Datasets are derived from HumanEval and MBPP by adding half of the canonical solutions to the prompts.
Code transformations must be syntactically correct and must not alter semantic meaning.
Three code refactoring transformations are adopted from Nat-Gen.
Three different schemes of variable renaming are implemented.

Natural transformations on code format

Perturbing partial code can be done by code format transformations
ReCode implements 3 methods of new line insertions
ReCode randomly replaces any space indent with tab or replaces tab with 4 spaces
ReCode adds first k/2 lines given a k-line canonical solution
ReCode selects the longest line of code and splits it into two lines in the middle
ReCode converts docstrings to comments

Evaluation metrics

Proposed transformations are randomized operations
Measure model robustness over multiple samples to reduce variance
Create s randomly perturbed prompts for each transformation and prompt
Measure worst-case performance across each group of s perturbed prompts
Three new metrics for robustness evaluation: RP s @k, RD s @k, RR s @k
High RP s @k does not necessarily mean low RD s @k or RR s @k

Evaluation

Used Hu-manEval and MBPP to demonstrate ReCode robustness evaluation framework
Performed comprehensive study of robustness evaluation on CodeGen, InCoder, and GPT-J
ReCode perturbations and metrics are general and applicable to any code generation datasets and models

Code generation robustness evaluation

Diverse pretraining corpus helps with both generalization and worst-case robustness
Larger model size brings improvement in worst-case robustness, but may risk overfitting
Code generation models are most sensitive to syntax perturbation
Datasets with more variances in code style pose more challenges on model robustness

Ablation study

Robustness metrics consider worst-case performance across s perturbed datasets.
Performance drops start converging when large enough s is evaluated.
Trade-off between cost and evaluation strength, recommend s = 5.
RD@k stays stable across different k while RR@k increases with k.

Perturbation sample quality

Human evaluation showed a 14% drop in naturalness for perturbed data
90% of data points were rated as exactly preserved in terms of semantics
Sentence cosine similarity between perturbed and non-perturbed docstrings and function names was 0.93 and 0.81 respectively
CodeBLEU scores for perturbed and nonperturbed data involving code syntax/format transformations was 0.96 and 0.97 respectively

Conclusion

Proposed ReCode benchmark for comprehensive robustness evaluation of code generation models
Collected and customized over 30 natural transformations under categories of docstrings, function names, code syntax, and code format perturbations
Preserve semantic meaning after perturbations
General worst-case robustness metrics to give unified overview of model robustness performance
Empirically demonstrated ReCode benchmark on popular models
Human evaluation confirms over 90% of perturbed data preserve original semantic meaning
Sentence similarity and CodeBLEU scores support quality of perturbations in ReCode
Aim to provide comprehensive robustness evaluation framework for any code-generation models
Robustness evaluation metrics allow users to assess model predictions with more confidence
Model trainers aware of potential vulnerabilities that might cause mispredictions
Transformations on docstrings include BackTranslation, ButterFingers, ChangeCharCase, EnglishInflectionalVariation, SwapCharacters, SynonymInsertion, SynonymSubstitution, TenseTransformationPast, TenseTransformationFuture, Whitespace
Transformations on function names include CamelCase, ButterFingers, SwapCharacters, ChangeCharCase, InflectionalVariation, SynonymSubstitution
Transformations on code content include DeadCodeInserter, For-While Switch, OperandSwap, VarRenamerCB, VarRenamerNaive, VarRenamerRN, Tab-Indent, Line Split, Doc2Comments, NewlineRandom, NewlineAfterCode, NewlineAfterDoc
Fleiss Kappa for annotations is 0.52, 0.36 for semantic and naturalness measurements on perturbed samples

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Methodology#

Problem formulation#

Natural transformations on docstrings#

Natural transformations on code syntax#

Natural transformations on code format#

Evaluation metrics#

Evaluation#

Code generation robustness evaluation#

Ablation study#

Perturbation sample quality#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Methodology

Problem formulation

Natural transformations on docstrings

Natural transformations on code syntax

Natural transformations on code format

Evaluation metrics

Evaluation

Code generation robustness evaluation

Ablation study

Perturbation sample quality

Conclusion