Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC improve documentation of NCR #1017

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 24 additions & 4 deletions doc/under_sampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -347,10 +347,30 @@ place. The class can be used as::
Our implementation offer to set the number of seeds to put in the set :math:`C`
originally by setting the parameter ``n_seeds_S``.

:class:`NeighbourhoodCleaningRule` will focus on cleaning the data than
condensing them :cite:`laurikkala2001improving`. Therefore, it will used the
union of samples to be rejected between the :class:`EditedNearestNeighbours`
and the output a 3 nearest neighbors classifier. The class can be used as::
Neighbourhood Cleaning Rule
^^^^^^^^^^^^^^^^^^^^^^^^^^^

The :class:`NeighbourhoodCleaningRule` is another "cleaning" algorithm. It removes
samples from the majority class that are closest to the boundary with the minority
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
samples from the majority class that are closest to the boundary with the minority
samples from the majority class that are the closest to the boundary formed by the samples of the minority class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't totally understand this sentence. Let me try a modification in a new commit.

:cite:`laurikkala2001improving`.

The :class:`NeighbourhoodCleaningRule` expands on the cleaning performed by
:class:`EditedNearestNeighbours` by eliminating additional majority class samples if
they are among the 3 closest neighbours of a sample from the minority class.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a parameter controlling the 3-NN.

Suggested change
they are among the 3 closest neighbours of a sample from the minority class.
they are among the :math:`N` closest neighbours (i.e. using the parameter `n_neighbours`) of a sample from the minority class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throughout the docs we are using K as the number of neighbours, not N. I guess the n in n_neighbours comes from n=number. I'd rather stick to K if that's alright with you, for consitency. I'll fix this in a separate commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I removed this sentence altogether as per below suggestion.


The procedure for the :class:`NeighbourhoodCleaningRule` is as follows:

1. Remove observations from the majority class with edited nearest neighbors (ENN).
2. Remove additional samples from the majority class if they are one of the k closest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we repeating the same sentence as above, I would remove the paragraph above and only go with the bullet point sequence.

neighbors of a minority sample, where all or most of those neighbors are not minority.

To carry out step 2 there is one condition: a sample will only be removed if its class
has a minimum number of observations. The minimum number of observations is regulated
by the `threshold_cleaning` parameter. In the original article
:cite:`laurikkala2001improving`, samples would be removed if the class had at
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not go in details regarding the original paper but instead just phrase that we check that the number of samples in the class to under-sample is above the threshold times the number of samples in the minority class.

least half as many observations as those in the minority class.

The class can be used as::

>>> from imblearn.under_sampling import NeighbourhoodCleaningRule
>>> ncr = NeighbourhoodCleaningRule(n_neighbors=11)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,8 @@
class NeighbourhoodCleaningRule(BaseCleaningSampler):
"""Undersample based on the neighbourhood cleaning rule.

This class uses ENN and a k-NN to remove noisy samples from the datasets.
This class uses ENN and a k-NN to remove noisy samples from the majority class or
classes.
solegalli marked this conversation as resolved.
Show resolved Hide resolved

Read more in the :ref:`User Guide <condensed_nearest_neighbors>`.

Expand All @@ -46,7 +47,8 @@ class NeighbourhoodCleaningRule(BaseCleaningSampler):
If ``int``, size of the neighbourhood to consider to compute the
K-nearest neighbors. If object, an estimator that inherits from
:class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
find the nearest-neighbors. By default, it will be a 3-NN.
find the nearest-neighbors. By default, it explores the 3 closest
neighbors.

kind_sel : {{"all", "mode"}}, default='all'
Strategy to use in order to exclude samples in the ENN sampling.
Expand All @@ -65,32 +67,33 @@ class NeighbourhoodCleaningRule(BaseCleaningSampler):
`"all"` strategy.

threshold_cleaning : float, default=0.5
Threshold used to whether consider a class or not during the cleaning
after applying ENN. A class will be considered during cleaning when:
Threshold used to determine if further samples will be removed from a certain
majority class during the cleaning step that follows the ENN. Additional
samples will be removed during the second step when:

Ci > C x T ,

where Ci and C is the number of samples in the class and the data set,
respectively and theta is the threshold.
where Ci is the number of samples in the class, C is the number of samples in
solegalli marked this conversation as resolved.
Show resolved Hide resolved
the data set, and T is the threshold.

{n_jobs}

Attributes
----------
sampling_strategy_ : dict
Dictionary containing the information to sample the dataset. The keys
corresponds to the class labels from which to sample and the values
correspond to the class labels from which to sample and the values
are the number of samples to sample.

edited_nearest_neighbours_ : estimator object
The edited nearest neighbour object used to make the first resampling.

nn_ : estimator object
Validated K-nearest Neighbours object created from `n_neighbors` parameter.
Validated K-nearest Neighbours object created from the `n_neighbors` parameter.

classes_to_clean_ : list
The classes considered with under-sampling by `nn_` in the second cleaning
phase.
The classes that statisfy the condition for further under-sampling during the
second cleaning phase.

sample_indices_ : ndarray of shape (n_new_samples,)
Indices of the samples selected.
Expand Down