🧼🔎 SelfClean

A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates, and label errors.

Publications: SelfClean Paper (NeurIPS24) | Data Cleaning Protocol Paper (ML4H23@NeurIPS)

NOTE: Make sure to have git-lfs installed before pulling the repository to ensure the pre-trained models are pulled correctly (git-lfs install instructions).

This project is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International license.

Installation

Install SelfClean via PyPI:

# upgrade pip to its latest version
pip install -U pip

# install selfclean
pip install selfclean

# Alternatively, use explicit python version (XX)
python3.XX -m pip install selfclean

Getting Started

You can run SelfClean in a few lines of code:

from selfclean import SelfClean

selfclean = SelfClean(
    # displays the top-7 images from each error type
    # per default this option is disabled
    plot_top_N=7, 
)

# run on pytorch dataset
issues = selfclean.run_on_dataset(
    dataset=copy.copy(dataset),
)
# run on image folder
issues = selfclean.run_on_image_folder(
    input_path="path/to/images",
)

# get the data quality issue rankings
df_near_duplicates = issues.get_issues("near_duplicates", return_as_df=True)
df_irrelevants = issues.get_issues("irrelevants", return_as_df=True)
df_label_errors = issues.get_issues("label_errors", return_as_df=True)

Examples: In examples/, we've provided some example notebooks in which you will learn how to analyze and clean datasets using SelfClean. These examples analyze different benchmark datasets such as:

Imagenette 🖼️ (Open in NBViewer | GitHub | Colab)
Oxford-IIIT Pet 🐶 (Open in NBViewer | GitHub | Colab)

Also, check out our Kaggle notebook to see an illustration of how to get a gold medal for cleaning a competition dataset.

Development Environment

Run make for a list of possible targets.

Run these commands to install the requirements for the development environment:

make init
make install

To run linters on all files:

pre-commit run --all-files

We use the following packages for code and test conventions:

black for code style
isort for import sorting
pytest for running tests

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
.github/workflows		.github/workflows
assets		assets
examples		examples
src		src
tests		tests
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.yamllint		.yamllint
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
common.mk		common.mk
pyproject.toml		pyproject.toml
requirements.extras.txt		requirements.extras.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧼🔎 SelfClean

Installation

Getting Started

Development Environment

About

Releases

Packages

Contributors 2

Languages

License

Digital-Dermatology/SelfClean

Folders and files

Latest commit

History

Repository files navigation

🧼🔎 SelfClean

Installation

Getting Started

Development Environment

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages