Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

RepoCoder is a framework to address the challenge of repository-level code completion.
RepoCoder utilizes a similarity-based retriever and a pre-trained code language model.
RepoCoder uses a novel iterative retrieval-generation paradigm.
RepoEval is a new benchmark for testing the performance of RepoCoder.
RepoCoder significantly improves the zero-shot code completion baseline.

Paper Content

Introduction

Awareness of other files in repository is important for software production.
Automated tools should take into account broader context in repository for code completion.
Interrelated dependencies, such as shared utilities, configurations, and cross-API invocations, exist in code files.
Repositories have unique naming conventions and coding styles.
Modularization and customization bring difficulties to automatic code completion tools.
RepoCoder proposed to bridge gap between retrieval context and intended completion target.
RepoEval benchmark created to evaluate repository-level code completion.
RepoCoder significantly outperforms zero-shot code completion paradigm.

Methodology

Overall framework

Task of code completion using language model M can be characterized as Ŷ = M(X)
RepoCoder is a framework that integrates code generation and retrieval models
Repository code files are partitioned into code snippets
Retrieval model R is used to obtain relevant code snippets from C repo
Language model M is used to perform retrieval-augmented generation
New query is established using Ŷ 1 for retrieval
Final prediction is obtained as Ŷ = M(C 2 ret , X)

Generation-augmented retrieval

Retriever employed in RepoCoder framework can be any model capable of searching for relevant documents.
Retrieval database is constructed using a sliding window which scans files and extracts contiguous lines of code.
Query is formulated using last S w lines of unfinished code X and most similar code snippets are retrieved.
Generate-then-retrieve paradigm proposed to mitigate issue of indiscriminately shifting all retrieved code snippets.

Retrieval-augmented generation

Generator used in RepoCoder framework is a pre-trained language model.
Need to incorporate global and local context for code completion.
Goal is to find optimal method of prompt construction.
Code snippets presented in ascending order based on similarity scores.
Prompt contains both global and local information needed for code completion.

Benchmark construction

Task of code completion in software repositories is common
Proposed RepoEval benchmark to evaluate code completion tools
Benchmark covers 3 levels of code completion granularity: line, API invocation, and function body
Utilize unit tests to evaluate correctness of completed functions
RepoEval is publicly available and labeled with source repository, file path, line numbers, and ground truth completion
8 repositories selected that meet criteria of open-source license, created after Jan 1, 2022, non-fork, more than 100 stars, >80% Python files, and explicit unit tests
1600 test samples for line completion, 1600 for API invocation, and 455 for function body completion

Experimental setup

Implementation details

Evaluate two distinct retrieval methods for RepoCoder
First method is a sparse bag-of-words model
Second method is a dense text embedding model
Test four pre-trained language models with varying code generation capabilities
Carefully consider various hyper-parameters to optimize performance

Methods for comparison

Previous studies have shown that large pre-trained language models can generate code in a zero-shot manner.
Utilizing intra-file context is valuable for code completion scenarios.
RepoCoder mitigates the issue of omitted context by adding it to the retrieval database.
RepoCoder uses generation-augmented retrieval to bridge the gap between retrieval and the intended completion target.
An oracle retrieval-augmented generation method is used to compare the efficacy of this approach.

Evaluation metrics

Evaluation of datasets uses Exact Match (EM) and Edit Similarity (ES) metrics
EM score is binary (1 if predicted code matches ground truth, 0 otherwise)
ES metric is calculated using Levenshtein distance
Unit tests used to evaluate function body completion dataset
Pass Rate (PR) reported (1 if code passes all test cases, 0 otherwise)

Experimental results

Line and api completion datasets

RepoCoder consistently and significantly improves the zero-shot performance on both datasets across all model sizes and retrieval methods.
RepoCoder outperforms RG-1 across all settings and gets close to the oracle performance.
RepoCoder even outperforms the zero-shot CODE-DAVINCI-002 model.
Simple sparse retriever has equivalent performance to the dense retriever.

Function completion dataset

Evaluated performance of RepoCoder on function body completion dataset
Used most powerful CODE-DAVINCI-002 model
Used sparse retriever
Performance of RepoCoder similar to line and API completion datasets
RepoCoder outperforms zero-shot baseline and one Retrieval-Generation iteration
Performance of RepoCoder close to oracle method

Statistics on retrieved code snippets

RepoCoder integrates code snippets into the prompt to provide context
Study examines impact of retrieved code snippets
Results show positive correlation between higher line overlap and better performance
Results show high token overlap between retrieved code snippets and ground truth completion

Context locations of effective retrieval

Code snippets retrieved from prompt bring relevant context from other files in repository
Study performed to understand impact of different context locations
Identified 2,364 and 1,866 code snippets for line and API completion datasets
Classification scheme of five distinct file locations used to locate original source of code snippets
Majority of code snippets located within defined categories
Majority of code snippets originate from files with “Similar Import”, “Similar Name”, or “Current Directory” locations
Stronger need for information obtained from other files in API completion scenarios

Code duplication in repositories

RepoCoder’s performance is positively correlated with the code duplication ratio of a repository.
Results show a correlation between RepoCoder’s performance and the code duplication ratio.
Highest duplication ratio results in highest performance improvement for RepoCoder.
Correlation between RepoCoder’s performance and the code duplication ratio is not absolute.

Study on failed cases

RepoCoder effectiveness and limitations investigated
Results suggest improvement with use of zero-shot, RG-1, RG-2, and oracle methods
Manual case study reveals majority of failures caused by misguided code retrievals
Model predictions not always suitable for retrieval
Language models sensitive to given prompts and small variations in code snippets
Need for more accurate evaluation methods

Global context in code completion is a challenge
Conventional code completion techniques analyze code and re-rank candidate suggestions
Code completion can also be approached as a language modeling task
Pre-trained language models have gained attention in code completion
Joint modeling of retrieval and generation is being explored for code generation
In-context joint retrieval and generation is a growing trend

Conclusion and future work

RepoCoder is a framework for repository-level code completion
Utilizes a similarity-based retriever and a pre-trained language model
Iterative retrieval and generation bridge the gap between retrieval context and the intended target
Experiments show RepoCoder improves zero-shot code completion performance
Comprehensive analysis provides insights into effectiveness and limitations of RepoCoder

Link to paper#

Abstract#

Paper Content#

Introduction#

Methodology#

Overall framework#

Generation-augmented retrieval#

Retrieval-augmented generation#

Benchmark construction#

Experimental setup#

Implementation details#

Methods for comparison#

Evaluation metrics#

Experimental results#

Line and api completion datasets#

Function completion dataset#

Statistics on retrieved code snippets#

Context locations of effective retrieval#

Code duplication in repositories#

Study on failed cases#

Related work#

Conclusion and future work#

Link to paper

Abstract

Paper Content

Introduction

Methodology

Overall framework

Generation-augmented retrieval

Retrieval-augmented generation

Benchmark construction

Experimental setup

Implementation details

Methods for comparison

Evaluation metrics

Experimental results

Line and api completion datasets

Function completion dataset

Statistics on retrieved code snippets

Context locations of effective retrieval

Code duplication in repositories

Study on failed cases

Related work

Conclusion and future work