diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index 9f2795430..8c2f91f70 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -347,10 +347,29 @@ place. The class can be used as:: Our implementation offer to set the number of seeds to put in the set :math:`C` originally by setting the parameter ``n_seeds_S``. -:class:`NeighbourhoodCleaningRule` will focus on cleaning the data than -condensing them :cite:`laurikkala2001improving`. Therefore, it will used the -union of samples to be rejected between the :class:`EditedNearestNeighbours` -and the output a 3 nearest neighbors classifier. The class can be used as:: +Neighbourhood Cleaning Rule +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The :class:`NeighbourhoodCleaningRule` is another "cleaning" algorithm. It removes +samples from the majority class that are the closest to the boundary they form with +the samples of the minority class :cite:`laurikkala2001improving`. + +The :class:`NeighbourhoodCleaningRule` expands on the cleaning performed by +:class:`EditedNearestNeighbours` by eliminating additional majority class samples. + +The procedure for the :class:`NeighbourhoodCleaningRule` is as follows: + +1. Remove observations from the majority class with edited nearest neighbors (ENN). +2. Remove additional samples from the majority class if they are one of the k closest +neighbors of a minority sample, where all or most of those neighbors are not minority. + +To carry out step 2 there is one condition: a sample will only be removed if its class +has a minimum number of observations. The minimum number of observations is regulated +by the `threshold_cleaning` parameter. A sample will only be removed from the target +class if it has at least as many observations as threshold times the number of samples +in the minority class. + +The class can be used as:: >>> from imblearn.under_sampling import NeighbourhoodCleaningRule >>> ncr = NeighbourhoodCleaningRule(n_neighbors=11) diff --git a/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py b/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py index 188ba32f3..ee13b4c17 100644 --- a/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py +++ b/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py @@ -29,7 +29,7 @@ class NeighbourhoodCleaningRule(BaseCleaningSampler): """Undersample based on the neighbourhood cleaning rule. - This class uses ENN and a k-NN to remove noisy samples from the datasets. + This class uses ENN and a k-NN to remove noisy samples from the majority classes(s). Read more in the :ref:`User Guide `. @@ -46,7 +46,8 @@ class NeighbourhoodCleaningRule(BaseCleaningSampler): If ``int``, size of the neighbourhood to consider to compute the K-nearest neighbors. If object, an estimator that inherits from :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to - find the nearest-neighbors. By default, it will be a 3-NN. + find the nearest-neighbors. By default, it explores the 3 closest + neighbors. kind_sel : {{"all", "mode"}}, default='all' Strategy to use in order to exclude samples in the ENN sampling. @@ -65,13 +66,14 @@ class NeighbourhoodCleaningRule(BaseCleaningSampler): `"all"` strategy. threshold_cleaning : float, default=0.5 - Threshold used to whether consider a class or not during the cleaning - after applying ENN. A class will be considered during cleaning when: + Threshold used to determine if further samples will be removed from a certain + majority class during the cleaning step that follows the ENN. Additional + samples will be removed during the second step when: Ci > C x T , - where Ci and C is the number of samples in the class and the data set, - respectively and theta is the threshold. + where Ci is the number of samples in the class to be under-sampled, C + is the number of samples in the data set, and T is the threshold. {n_jobs} @@ -79,18 +81,18 @@ class NeighbourhoodCleaningRule(BaseCleaningSampler): ---------- sampling_strategy_ : dict Dictionary containing the information to sample the dataset. The keys - corresponds to the class labels from which to sample and the values + correspond to the class labels from which to sample and the values are the number of samples to sample. edited_nearest_neighbours_ : estimator object The edited nearest neighbour object used to make the first resampling. nn_ : estimator object - Validated K-nearest Neighbours object created from `n_neighbors` parameter. + Validated K-nearest Neighbours object created from the `n_neighbors` parameter. classes_to_clean_ : list - The classes considered with under-sampling by `nn_` in the second cleaning - phase. + The classes that statisfy the condition for further under-sampling during the + second cleaning phase. sample_indices_ : ndarray of shape (n_new_samples,) Indices of the samples selected.