Learning Real Bug Detectors

This is the official repository for the paper: On Distribution Shift in Learning-based Bug Detectors.

Setup

The code requires python3 (we use python3.9) and some Python packages that can be installed via pip install -r requirements.txt. Make sure to add this repository to PYTHONPATH.

Downloading Datasets and Models

We provide the following resources for download:

Our datasets: link.
Our fine-tuned models: link.
Pretrained models (converted from CuBERT, including the tokenizer vocabulary): link.

After downloading and decompressing the above files, the directory structure should be organized as follows:

└──learning-real-bug-detector
│
└──dataset
│   
└──fine-tuned
│   
└──pretrained

Running the Code

You can run the code via the scripts under the scripts/ directory.

Evaluation and Fine-tuning

Evaluation can be done with the command below, where TASK_NAME is the bug type (var-misuse, wrong-binary-operator, or argument-swap). MODEL_NAME is the name of the model (e.g., model if you use our fine-tuned models). Optionally, you can use the --probs_file to store the prediction results and use calculate_ap.py to compute average precision.

(scripts/) $ python eval.py --task TASK_NAME --model MODEL_NAME

Fine-tuning can be done with the command below, where DATASET_NAME can be real, synthetic, or contrastive. The paper describes a two-phase training scheme, first with --dataset contrastive and then with --dataset real (use --pretrained to continue from the previous checkpoint). Other fine-tuning parameters are defaulted to be the best parameters in the paper evaluation.

(scripts/) $ python fine-tune.py --task TASK_NAME --model MODEL_NAME --dataset DATASET_NAME

Constructing Datasets from Scratch

If you are interested in constructing the datasets from scratch, you need to clone eth_py150_open, download py150_files, and install near-duplicate-code-detector. For var-misuse and wrong-binary-operator, the datasets constructed from eth_py150 repositories have a sufficient amount of real bugs. For argument-swap, more repositories are needed to produce enough real bugs. The directory structure should be organized as follows:

└──learning-real-bug-detector
    │
    └──data
        │
        └──near-duplicate-code-detector
│
└──eth_py150_open
│   
└──py150_files
    │
    └──data

Then run the following commands:

(scripts/data_gen_real/) $ python clone_repos.py --in_file all_py150_repos.txt
(scripts/data_gen_real/) $ ./run_real_bugs_from_repo.sh TASK_NAME
(scripts/data_gen_synthetic/) $ python gen_jsontxt.py --task TASK_NAME
(scripts/data_gen_synthetic/) $ python clean_jsontxt.py --task TASK_NAME
(scripts/data_gen_real/) $ python split_real.py --task TASK_NAME
(scripts/data_gen_synthetic/) $ python filter_train_data.py --task TASK_NAME
(scripts/data_gen_synthetic/) $ python gen_synthetic_train_data.py --task TASK_NAME
(scripts/data_gen_synthetic/) $ python gen_synthetic_train_data.py --task TASK_NAME --contrastive

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
realbuglearn		realbuglearn
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning Real Bug Detectors

Setup

Downloading Datasets and Models

Running the Code

Evaluation and Fine-tuning

Constructing Datasets from Scratch

About

Releases

Packages

Languages

eth-sri/learning-real-bug-detector

Folders and files

Latest commit

History

Repository files navigation

Learning Real Bug Detectors

Setup

Downloading Datasets and Models

Running the Code

Evaluation and Fine-tuning

Constructing Datasets from Scratch

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages