diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index 9f2795430..38b87540d 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -125,9 +125,20 @@ It would also work with pandas dataframe:: >>> df_resampled, y_resampled = rus.fit_resample(df_adult, y_adult) >>> df_resampled.head() # doctest: +SKIP -:class:`NearMiss` adds some heuristic rules to select samples -:cite:`mani2003knn`. :class:`NearMiss` implements 3 different types of -heuristic which can be selected with the parameter ``version``:: +NearMiss +^^^^^^^^ + +:class:`NearMiss` is another controlled under-sampling technique. It aims to balance +the class distribution by eliminating samples from the targeted classes. But these +samples are not removed at random. Instead, :class:`NearMiss` removes instances of the +target class(es) that increase the "space" or separation between the target class and +the minority class. In other words, :class:`NearMiss` removes observations from the +target class that are closer to the boundary they form with the minority class samples. + +To find out which samples are closer to the boundary with the minority class, +:class:`NearMiss` uses the K-Nearest Neighbours algorithm. :class:`NearMiss` implements +3 different heuristics, which we can select with the parameter ``version`` and we +will explain in the coming paragraphs. We can perform this undersampling as follows:: >>> from imblearn.under_sampling import NearMiss >>> nm1 = NearMiss(version=1) @@ -135,65 +146,75 @@ heuristic which can be selected with the parameter ``version``:: >>> print(sorted(Counter(y_resampled).items())) [(0, 64), (1, 64), (2, 64)] -As later stated in the next section, :class:`NearMiss` heuristic rules are -based on nearest neighbors algorithm. Therefore, the parameters ``n_neighbors`` -and ``n_neighbors_ver3`` accept classifier derived from ``KNeighborsMixin`` -from scikit-learn. The former parameter is used to compute the average distance -to the neighbors while the latter is used for the pre-selection of the samples -of interest. Mathematical formulation -^^^^^^^^^^^^^^^^^^^^^^^^ +~~~~~~~~~~~~~~~~~~~~~~~~ + +:class:`NearMiss` uses the K-Nearest Neighbours algorithm to identify the samples of the +target class(es) that are closer to the minority class, as well as the distance that +separates them. -Let *positive samples* be the samples belonging to the targeted class to be -under-sampled. *Negative sample* refers to the samples from the minority class -(i.e., the most under-represented class). +Let *positive samples* be the samples from the class to be under-sampled, and +*negative sample* the samples from the minority class (i.e., the most +under-represented class). -NearMiss-1 selects the positive samples for which the average distance -to the :math:`N` closest samples of the negative class is the smallest. +**NearMiss-1** selects the positive samples whose average distance to the :math:`K` +closest samples of the negative class is the smallest (:math:`K` is the number of +neighbours in the K-Nearest Neighbour algorithm). The following image illustrates the +logic: .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_nearmiss_001.png :target: ./auto_examples/under-sampling/plot_illustration_nearmiss.html :scale: 60 :align: center -NearMiss-2 selects the positive samples for which the average distance to the -:math:`N` farthest samples of the negative class is the smallest. +**NearMiss-2** selects the positive samples whose average distance to the +:math:`K` farthest samples of the negative class is the smallest. The following image +illustrates the logic: .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_nearmiss_002.png :target: ./auto_examples/under-sampling/plot_illustration_nearmiss.html :scale: 60 :align: center -NearMiss-3 is a 2-steps algorithm. First, for each negative sample, their -:math:`M` nearest-neighbors will be kept. Then, the positive samples selected -are the one for which the average distance to the :math:`N` nearest-neighbors -is the largest. +**NearMiss-3** is a 2-steps algorithm: + +First, for each negative sample, that is, for each observation of the minority class, +it selects :math:`M` nearest-neighbors from the postivie class (target class). This +ensures that all observations from the minority class have at least some neighbours +from the target class. + +Next, it selects positive samples whose average distance to the :math:`K` +nearest-neighbors of the minority class is the largest. + +The following image illustrates the logic: .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_nearmiss_003.png :target: ./auto_examples/under-sampling/plot_illustration_nearmiss.html :scale: 60 :align: center -In the next example, the different :class:`NearMiss` variant are applied on the -previous toy example. It can be seen that the decision functions obtained in -each case are different. - -When under-sampling a specific class, NearMiss-1 can be altered by the presence -of noise. In fact, it will implied that samples of the targeted class will be -selected around these samples as it is the case in the illustration below for -the yellow class. However, in the normal case, samples next to the boundaries -will be selected. NearMiss-2 will not have this effect since it does not focus -on the nearest samples but rather on the farthest samples. We can imagine that -the presence of noise can also altered the sampling mainly in the presence of -marginal outliers. NearMiss-3 is probably the version which will be less -affected by noise due to the first step sample selection. +In the following example, we apply the different :class:`NearMiss` variants to a toy +dataset. Note how the decision functions obtained in each case are different (left +plots): .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_003.png :target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html :scale: 60 :align: center +NearMiss-1 is sensitive to noise. In fact, we could think that those observations from +the target class that are closer to samples from the minority class are indeed noise. +NearMiss-1 will select however, those observations, as shown in the first row of the +previous illustration (check the yellow class). + +NearMiss-2 will be less sensitive to noise since it does not select the nearest, but +rather on the farthest samples of the target class. + +NearMiss-3 is probably the least sensitive version to noise due to the first sample +selection step. + + Cleaning under-sampling techniques ---------------------------------- diff --git a/imblearn/under_sampling/_prototype_selection/_nearmiss.py b/imblearn/under_sampling/_prototype_selection/_nearmiss.py index 70f647fa5..7073a8cf2 100644 --- a/imblearn/under_sampling/_prototype_selection/_nearmiss.py +++ b/imblearn/under_sampling/_prototype_selection/_nearmiss.py @@ -35,20 +35,17 @@ class NearMiss(BaseUnderSampler): n_neighbors : int or estimator object, default=3 If ``int``, size of the neighbourhood to consider to compute the - average distance to the minority point samples. If object, an + average distance to the minority samples. If object, an estimator that inherits from :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to find the k_neighbors. - By default, it will be a 3-NN. n_neighbors_ver3 : int or estimator object, default=3 - If ``int``, NearMiss-3 algorithm start by a phase of re-sampling. This - parameter correspond to the number of neighbours selected create the - subset in which the selection will be performed. If object, an - estimator that inherits from + Only used if `version=3`. If ``int``, the number of target class samples that + are closest to a minority sample that will be retained in the first subsampling + step. If object, an estimator that inherits from :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to find the k_neighbors. - By default, it will be a 3-NN. {n_jobs} @@ -56,7 +53,7 @@ class NearMiss(BaseUnderSampler): ---------- sampling_strategy_ : dict Dictionary containing the information to sample the dataset. The keys - corresponds to the class labels from which to sample and the values + correspond to the class labels from which to sample and the values are the number of samples to sample. nn_ : estimator object @@ -144,7 +141,7 @@ def __init__( def _selection_dist_based( self, X, y, dist_vec, num_samples, key, sel_strategy="nearest" ): - """Select the appropriate samples depending of the strategy selected. + """Select the appropriate samples depending on the selected strategy. Parameters ----------