We propose a simulation framework for generating instance-dependent noisy labels via a pseudo-labeling paradigm. We show that this framework generates synthetic noisy labels whose distribution is closer to human labels compared to independent and class-conditional random flipping. Equipped with controllable label noise, we study the negative impact of noisy labels across a few practical settings to understand when label noise is more problematic. Additionally, with the availability of annotator information from our simulation framework, we propose a new technique, Label Quality Model (LQM), that leverages annotator features to predict and correct against noisy labels. We show that by adding LQM as a label correction step before applying existing noisy label techniques, we can further improve the models' performance.
An Instance-Dependent Simulation Framework for Learning with Label Noise.
In this repository, we provide the link to the datasets that we used in Sections 4 and 5 of the above paper, along with a colab that demonstrates how to load the data and rater features. We consider 4 tasks: CIFAR10, CIFAR100, Patch Camelyon, and Cats vs Dogs. For each task, we generate three synthetic noisy label datasets, named as "low", "medium", and "high" according to the amount of label noise. The data are stored as TFRecords and the rater features are stored as json files.
The data is available under noisy label synthetic dataset GCP bucket.
The colab that contains details of the datasets and examples for data loading is at this colab example
The noisy labels and rater features in our datasets are under the CC0 License. Other parts of the datasets are under the original license of the datasets.
When using the datasets based on CIFAR10/CIFAR100, users are required to attribute the following paper:
Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009
When using the datasets based on Patch Camelyon, users are required to attribute the following paper:
Rotation Equivariant CNNs for Digital Pathology, Bastiaan S. Veeling, Jasper Linmans, Jim Winkens, Taco Cohen, and Max Welling, arXiv:1806.03962.
When using the datasets based on Cats vs Dogs, users are required to attribute the following paper:
Asirra: a CAPTCHA that exploits interest-aligned manual image categorization, Jeremy Elson, John R. Douceur, Jon Howell, and Jared Saul, ACM Conference on Computer and Communications Security, 2007.
The colab example is provided under the Apache License, Version 2.0.
Please use the following bibtex for citations to our paper:
@article{gu2021instance,
title={An Instance-Dependent Simulation Framework for Learning with Label Noise},
author={Gu, Keren and Masotto, Xander and Bachani, Vandana and Lakshminarayanan, Balaji and Nikodem, Jack and Yin, Dong},
year={2021}
}
The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.
property | value | ||||||
---|---|---|---|---|---|---|---|
name | Noisy Label Synthetic Datasets |
||||||
url | https://github.com/deepmind/deepmind-research/tree/master/noisy_label |
||||||
sameAs | https://github.com/deepmind/deepmind-research/tree/master/noisy_label |
||||||
description |
Data accompanying
[An Instance-Dependent Simulation Framework for Learning with Label Noise]().
|
||||||
provider |
|
||||||
citation | https://identifiers.org/arxiv:2107.11413 |