A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates, and label errors.
Publications: SelfClean Paper (NeurIPS24) | Data Cleaning Protocol Paper (ML4H23@NeurIPS)
NOTE: Make sure to have git-lfs
installed before pulling the repository to ensure the pre-trained models are pulled correctly (git-lfs install instructions).
This project is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International license.
Install SelfClean via PyPI:
# upgrade pip to its latest version
pip install -U pip
# install selfclean
pip install selfclean
# Alternatively, use explicit python version (XX)
python3.XX -m pip install selfclean
You can run SelfClean in a few lines of code:
from selfclean import SelfClean
selfclean = SelfClean(
# displays the top-7 images from each error type
# per default this option is disabled
plot_top_N=7,
)
# run on pytorch dataset
issues = selfclean.run_on_dataset(
dataset=copy.copy(dataset),
)
# run on image folder
issues = selfclean.run_on_image_folder(
input_path="path/to/images",
)
# get the data quality issue rankings
df_near_duplicates = issues.get_issues("near_duplicates", return_as_df=True)
df_irrelevants = issues.get_issues("irrelevants", return_as_df=True)
df_label_errors = issues.get_issues("label_errors", return_as_df=True)
Examples:
In examples/
, we've provided some example notebooks in which you will learn how to analyze and clean datasets using SelfClean.
These examples analyze different benchmark datasets such as:
- Imagenette 🖼️ (Open in NBViewer | GitHub | Colab)
- Oxford-IIIT Pet 🐶 (Open in NBViewer | GitHub | Colab)
Also, check out our Kaggle notebook to see an illustration of how to get a gold medal for cleaning a competition dataset.
Run make
for a list of possible targets.
Run these commands to install the requirements for the development environment:
make init
make install
To run linters on all files:
pre-commit run --all-files
We use the following packages for code and test conventions:
black
for code styleisort
for import sortingpytest
for running tests