Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Pre-trained language models for code have achieved success in code completion, but only use in-file context.
- Cross-file context is a critical source of information for modern modular software development.
- CCFINDER is a tool that locates and retrieves relevant cross-file context.
- CoCoMIC is a framework that incorporates cross-file context to learn in-file and cross-file context jointly.
- CoCoMIC improves code LM with a 19.30% relative increase in exact match and a 15.41% relative increase in identifier matching for code completion.
Paper Content
Introduction
- Language models for code have potential to improve developer productivity
- Modular programming is a software design strategy that divides complex software into components
- Cross-file context is important for code completion
- Code LMs should generate code based on in-file and cross-file context
- Challenges in developing such models include identifying relevant cross-file context and designing a framework for aggregating information
- CCFINDER is a static code analysis tool to retrieve relevant cross-file context
- COCOMIC is a novel framework built on existing code LMs with joint attention to in-file and cross-file contexts
Preliminaries
- Project entities are code components that make up software projects
- Four types of code components: files, global variables, classes, and functions
- Locale is the entity’s source code location within the software project
- In-file context is code snippets in the current file
- Cross-file context is code snippets from the same project but not the current file
- CCFINDER is a tool to identify and retrieve relevant cross-file context
Project-level context graph
- Extract entities and relations in a project and build a directed context graph
- Parse project structure and source files to identify entities and relations
- Build graph nodes and edges with top-down style
- Graph describes interactions among code snippets
- Graph differs from traditional program dependence graph and code property graph
Cross-file context retrieval
- Build context graph of current project
- Query built graph for project details
- Closer a graph neighbor to root node, more relevant it is
- Extract import statements from incomplete code file
- Generate imported content’s relative project path
- Locate imported entities in project-level context graph
- Retrieve k-hop neighbors for all import statements
Input representation
- Model input includes source code sample and cross-file context
- Source code is a sequence of tokens
- Cross-file context is a list of entities
- Each entity is a short piece of code sequence
- Locale of each entity is prepended to its code text
- Special token [SUM] is appended to entity descriptions
- Model will attend to representations of [SUM] tokens for cross-file context
Encoding cross-file context
- Computational cost of Transformer model increases exponentially with input length
- Prepending all retrieved entities as plain text is impractical due to large number of tokens
- Not all tokens in an entity contribute to code completion
- COCOMIC encodes each entity into a single representation
- Outputs a list of entity embeddings representing retrieved cross-file context
Modeling in-file and cross-file context for code completion
- COCOMIC utilizes the causal language model setting to support the code completion task
- Each token considers its former texts as in-file context
- Different layers of a Transformer model capture different language components
- In-file and cross-file context are fused at each Transformer layer
- For testing, incomplete programs are created as prompts and the model is asked to predict the following pieces of code
Implementation details
- CCFINDER uses tree-sitter to parse source files and generate an AST.
- CCFINDER extracts information from the AST and analyzes import statements to build a project-level context graph.
- The model input includes up to 128 project entities with up to 128 tokens each.
- The backbone of the model is CodeGen-350M-Mono, which is finetuned for five epochs.
Baselines & evaluation metrics
- Zero-shot CodeGen model is loaded and evaluated on test dataset
- Finetuned CodeGen model is finetuned on dataset for comparison
- Cross-file context is prepended to input sequence and finetuned
- Code match and identifier match are used to evaluate performance
- Exact match, BLEU-4, precision, recall, and perplexity are used to measure results
Results and analysis
- COCOMIC outperforms all baselines on all metrics
- CodeGen outperforms other two baselines without cross-file context
- COCOMIC encodes code sequence of a node into one single token
- COCOMIC achieves lowest perplexity
- Locales provide an exact and direct signal as text
- Multi-task learning encodes cross-file relations
- Mean pooling takes mean over every cross-file token’s embedding
Related work
- Significant effort has been made to pretrain Transformer language models (LMs) for software engineering applications
- Several code generation models have been developed
- Most models generate code from left to right in an autoregressive manner
- Existing works use code from the current file to prompt the code generation models
- Zhou et al. (2022) proposed to retrieve API documentation given a natural language intent
- Our work proposes to retrieve cross-file context given a source code
- Earlier works in software engineering literature focused on developing tools to extract information from software repositories
- Recent works focus on modeling cross-file information in neural approaches
Conclusion
- Language models for code have been successful, but lack cross-file context
- COCOMIC framework proposed to incorporate in-file and cross-file context
- CCFINDER tool builds project-level context graph
- COCOMIC improves performance by 19.30% over in-file only baseline
- COCOMIC uses [SUM] token to represent cross-file context
- CCFINDER retrieves 37.11% more identifiers than in-file only context