This repository contains replication code for the paper "Enriching Source Code with Contextual Data for Code Completion Models: An Empirical Study".
Ensure you have NodeJS and Python 3 installed. Then install dependencies:
pip install -r requirements.txt
npm install
The dataset can be retrieved from Zenodo. The dataset should be extracted to the ./data
folder of this repository.
The two main files to run for replication are create-datasets.sh
and evaluate.sh
.
These files should be run in the root directory of the project (i.e. directly in the Replication-Code
folder).
This file does the following:
- Copy the dataset and add marker comments (/*marker:number*/)
- Copy the marked dataset, install third-party dependencies, and add type annotations
- Determine which projects had all dependencies installed succesfully
- Copy the marked dataset, and remove all type annotations
- Analyze the dataset (#LOC, #Files, Type Explicitness)
- Create train/test/validation files for consumption by UniXcoder, CodeGPT, and InCoder 6.1 Note that UniXcoder and CodeGPT use the same input files. In practice, it will only show files for UniXcoder, but these are intended to be used for both UniXcoder and CodeGPT.
This file does the following:
- Post process predictions
- Evaluate post processed predictions/computes all metrics (both for complete lines and single tokens)
- Performs the statistical analysis
This is done for every model.
Note that this script expects a predictions
folder to be present inside the data
folder.
The predictions
folder should have subfolders of the format ./data/predictions/<unixcoder|codegpt|incoder>/<normal|untyped|explicit>-<all|none|docblock|single_line|multi_line>/
and should contain the respective test.json
file for the model & dataset, and predictions.txt
file generated based on this test.json
file.
Some parameters can be configured through config.json
.