LIICD

This repository hosts two Python implementations of a language-agnostic Incremental Clone Detector capable of detecting Type-1 clones. Both tools have been developed in the context of a Master thesis study in collaboration with the Software Improvement Group (SIG) in the Netherlands.

Implementations

The LIICD (under /original) implements Hummel's clone-index based approach (skipping the normalization step)
The LSH-based (under /LSH-based) utilizes Locality Sensitive Hashing (LSH) to calculate the clones for files that were found to be similar.

Requirements

Python: 3.7+
For both sub-projects, install dependencies via pip install requirements.txt

Usage

1. Generate Config file

Both implementations can be run via the main script main.py. Before that, the generation of a configuration file indicating the commits to be analyzed, is necessary. This can be done through the generate_config.py script which takes a git-tracked repository and the number of commits as parameters and generates the desired file. The format of such a file looks as follows:

{
    "commits": [
        {
            "id": "cb8f645e0f",
            "changes": [
                {
                    "type": "M",
                    "filename": "lib/ansible/plugins/loader.py"
                }
            ]
        },
        {
            "id": "564907d8ac",
            "changes": [
                {
                    "type": "A",
                    "filename": "changelogs/fragments/distribution_test_refactor.yml"
                },
                {
                    "type": "R",
                    "filename": [
                        "test/units/module_utils/facts/system/distribution/__init__.py",
                        "test/units/module_utils/facts/system/__init__.py"
                    ]
                },
                {
                    "type": "D",
                    "filename": "test/units/module_utils/facts/system/distribution/fixtures/arch_linux_na.json"
                }
            ]
        }
    ]
}

The generated configuration file, stored under configurations/{project_name} must then be given as argument to the main.py script.

2. Run the Detector

The next step is to run the desired implementation, passing the required arguments. These are, the path to the repo to be analyzed, the path to the config file and the number of commits (included in that config file).

python -m detector.main -p ~/projects/{my_project}/ -u ~/CloneDetector/configurations/{my_project}_updates.json -c 50

Note: Ensure that the codebase is checked-out @HEAD.

main.py Arguments:

-p: The path to the software project to be analyzed
-u: The path to the configuration file that holds the commits to be analyzed
-c: The number of commits to be analyzed

generate_config.py Arguments:

-p: The path of the codebase for which we generate the config
-n: The number of commits to analyze (default 10)

Configuration Parameters

Both implementations include configuration parameters that allow for additional tuning. These, along with their default values, can be found in the config.py file of each subdirectory.

LIICD

CHUNK_SIZE: The number of lines for each consecutive block to be hashed. Consequently, defines the length of the minimum clone. (default: 6)
COMMITS: The number of subsequent commits to analyze. The repository under analysis must have at least COMMITS + 1 commits since the intermediate data are constructed from HEAD-COMMITS-1. (default: 2)
SKIP_DIRS: A list of directories that are excluded from the analysis.
SKIP_FILES: A list of file extensions that are excluded from the analysis.

LSH-based

CHUNK_SIZE: Identical to LIICD.
COMMITS: Identical to LIICD.
SKIP_DIRS: Identical to LIICD.
SKIP_FILES: Identical to LIICD.
THRESHOLD: The threshold of similarity based on which the files are compared. (default: 0.2)
PERMUTATIONS:: The number of hash functions that are used for the generation of the MinHash signature. Affects the error rate. (default: 68)

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
LSH-based		LSH-based
Original		Original
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE.md		LICENSE.md
README.md		README.md
generate_config.py		generate_config.py
generate_csv.py		generate_csv.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LIICD

Implementations

Requirements

Usage

1. Generate Config file

2. Run the Detector

Configuration Parameters

LIICD

LSH-based

About

Languages

License

agamvrinos/LIICD

Folders and files

Latest commit

History

Repository files navigation

LIICD

Implementations

Requirements

Usage

1. Generate Config file

2. Run the Detector

Configuration Parameters

LIICD

LSH-based

About

Topics

Resources

License

Stars

Watchers

Forks

Languages