Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Pre-trained language models for code have achieved success in code completion, but only use in-file context.
  • Cross-file context is a critical source of information for modern modular software development.
  • CCFINDER is a tool that locates and retrieves relevant cross-file context.
  • CoCoMIC is a framework that incorporates cross-file context to learn in-file and cross-file context jointly.
  • CoCoMIC improves code LM with a 19.30% relative increase in exact match and a 15.41% relative increase in identifier matching for code completion.

Paper Content

Introduction

  • Language models for code have potential to improve developer productivity
  • Modular programming is a software design strategy that divides complex software into components
  • Cross-file context is important for code completion
  • Code LMs should generate code based on in-file and cross-file context
  • Challenges in developing such models include identifying relevant cross-file context and designing a framework for aggregating information
  • CCFINDER is a static code analysis tool to retrieve relevant cross-file context
  • COCOMIC is a novel framework built on existing code LMs with joint attention to in-file and cross-file contexts

Preliminaries

  • Project entities are code components that make up software projects
  • Four types of code components: files, global variables, classes, and functions
  • Locale is the entity’s source code location within the software project
  • In-file context is code snippets in the current file
  • Cross-file context is code snippets from the same project but not the current file
  • CCFINDER is a tool to identify and retrieve relevant cross-file context

Project-level context graph

  • Extract entities and relations in a project and build a directed context graph
  • Parse project structure and source files to identify entities and relations
  • Build graph nodes and edges with top-down style
  • Graph describes interactions among code snippets
  • Graph differs from traditional program dependence graph and code property graph

Cross-file context retrieval

  • Build context graph of current project
  • Query built graph for project details
  • Closer a graph neighbor to root node, more relevant it is
  • Extract import statements from incomplete code file
  • Generate imported content’s relative project path
  • Locate imported entities in project-level context graph
  • Retrieve k-hop neighbors for all import statements

Input representation

  • Model input includes source code sample and cross-file context
  • Source code is a sequence of tokens
  • Cross-file context is a list of entities
  • Each entity is a short piece of code sequence
  • Locale of each entity is prepended to its code text
  • Special token [SUM] is appended to entity descriptions
  • Model will attend to representations of [SUM] tokens for cross-file context

Encoding cross-file context

  • Computational cost of Transformer model increases exponentially with input length
  • Prepending all retrieved entities as plain text is impractical due to large number of tokens
  • Not all tokens in an entity contribute to code completion
  • COCOMIC encodes each entity into a single representation
  • Outputs a list of entity embeddings representing retrieved cross-file context

Modeling in-file and cross-file context for code completion

  • COCOMIC utilizes the causal language model setting to support the code completion task
  • Each token considers its former texts as in-file context
  • Different layers of a Transformer model capture different language components
  • In-file and cross-file context are fused at each Transformer layer
  • For testing, incomplete programs are created as prompts and the model is asked to predict the following pieces of code

Implementation details

  • CCFINDER uses tree-sitter to parse source files and generate an AST.
  • CCFINDER extracts information from the AST and analyzes import statements to build a project-level context graph.
  • The model input includes up to 128 project entities with up to 128 tokens each.
  • The backbone of the model is CodeGen-350M-Mono, which is finetuned for five epochs.

Baselines & evaluation metrics

  • Zero-shot CodeGen model is loaded and evaluated on test dataset
  • Finetuned CodeGen model is finetuned on dataset for comparison
  • Cross-file context is prepended to input sequence and finetuned
  • Code match and identifier match are used to evaluate performance
  • Exact match, BLEU-4, precision, recall, and perplexity are used to measure results

Results and analysis

  • COCOMIC outperforms all baselines on all metrics
  • CodeGen outperforms other two baselines without cross-file context
  • COCOMIC encodes code sequence of a node into one single token
  • COCOMIC achieves lowest perplexity
  • Locales provide an exact and direct signal as text
  • Multi-task learning encodes cross-file relations
  • Mean pooling takes mean over every cross-file token’s embedding
  • Significant effort has been made to pretrain Transformer language models (LMs) for software engineering applications
  • Several code generation models have been developed
  • Most models generate code from left to right in an autoregressive manner
  • Existing works use code from the current file to prompt the code generation models
  • Zhou et al. (2022) proposed to retrieve API documentation given a natural language intent
  • Our work proposes to retrieve cross-file context given a source code
  • Earlier works in software engineering literature focused on developing tools to extract information from software repositories
  • Recent works focus on modeling cross-file information in neural approaches

Conclusion

  • Language models for code have been successful, but lack cross-file context
  • COCOMIC framework proposed to incorporate in-file and cross-file context
  • CCFINDER tool builds project-level context graph
  • COCOMIC improves performance by 19.30% over in-file only baseline
  • COCOMIC uses [SUM] token to represent cross-file context
  • CCFINDER retrieves 37.11% more identifiers than in-file only context