Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] ENH: SPIDER Sampling Algorithm #603

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

MattEding
Copy link
Contributor

@MattEding MattEding commented Sep 19, 2019

As per requested by the pinned New Methods, I have implemented the Selective Pre-processing of Imbalanced Data (SPIDER) sampling algorithm.

I have developed unit tests based on drawing out a sample dataset and working out by hand what the expected results would be as it is deterministic. See the following notebook for diagrams here.

Currently, the implementation for dense and sparse return the same data points but in different orders (np.lexsort will show they are indeed the same). Consequently this fails the existing unit test that compares dense vs sparse outputs. I would rather not have to require sorting results to ensure that test passes due to the overhead for sorting large datasets. Maybe I can just have a parameter that defaults sort=True to give the option to bypass this issue.

The only other unit test I saw fail with PyTest is a with raises(MESSAGE), but I will fix that later since I am not sure if other developers will want to keep my new sample_type = 'preprocess-strategy'. I had chosen do this because SPIDER does both cleaning and oversampling and I did not feel that inheriting from either was appropriate--oversamplers allow sampling_strategy as a float which does not make sense for this algorithm, and cleaners only really allow for under-sampling sampling strategies which is also inappropriate.

Benchmark results using cross-validation with the mean of 5 folds comparing None, NCR, SMOTE, and the 3 SPIDER variants are here in the PKL, CSV, and PNG folders. I used the Zenodo datasets (excluding set 26 due to it being a large dataset for my local computer to work with in a time effective manner). I set all n_jobs=-1 on my 4 core macbook if you want to infer the time values. Additionally when a scorer was undefined, I left the value as 0 rather than changing it to NaN.

TODO:

  • Resolve the two failing tests I addressed above
  • Online reference explaining the algorithm
  • Possibly re-categorize algorithm
  • Maybe change self._{xyz} to be variables to be explicitly passed into methods to avoid possible multiprocessing mutation complications

@pep8speaks
Copy link

pep8speaks commented Sep 19, 2019

Hello @MattEding! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-22 00:59:05 UTC

@MattEding MattEding changed the title [MRG] ENH: SPIDER Sampling Algorithm [WIP] ENH: SPIDER Sampling Algorithm Sep 19, 2019
@chkoar
Copy link
Member

chkoar commented Sep 19, 2019

Thanks Matt! That would be a nice addition!

@MattEding
Copy link
Contributor Author

Rough draft for online documentation illustrations of how the SPIDER algorithm works for arbitrary X and y values. The generated plots work as planned; just need to add written text discussing how to interpret graphs. Also needs to be moved from .ipynb to .py that can be understood by sphinx.
https://github.com/MattEding/SPIDER-Algorithm/blob/master/SPIDER-Formulation.ipynb

@codecov
Copy link

codecov bot commented Sep 30, 2019

Codecov Report

Merging #603 into master will decrease coverage by 0.02%.
The diff coverage is 97.92%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #603      +/-   ##
==========================================
- Coverage   98.46%   98.43%   -0.03%     
==========================================
  Files          82       86       +4     
  Lines        4886     5062     +176     
==========================================
+ Hits         4811     4983     +172     
- Misses         75       79       +4
Impacted Files Coverage Δ
imblearn/utils/_validation.py 100% <100%> (ø) ⬆️
imblearn/combine/_preprocess/base.py 100% <100%> (ø)
imblearn/combine/_preprocess/__init__.py 100% <100%> (ø)
imblearn/utils/tests/test_validation.py 100% <100%> (ø) ⬆️
imblearn/combine/tests/test_spider.py 100% <100%> (ø)
imblearn/combine/__init__.py 100% <100%> (ø) ⬆️
imblearn/combine/_preprocess/_spider.py 96.39% <96.39%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3839df1...53dfc4e. Read the comment docs.

@MattEding
Copy link
Contributor Author

With Travis CI DISTRIB="ubuntu", it appears that this build is using Tensorflow 2.0 as the unit tests are failing with AttributeError: module 'tensorflow' has no attribute 'placeholder'.

@glemaitre
Copy link
Member

I would suggest moving this implementation into smote_variants. The idea behind this move is to benchmark the smote variants on a common benchmark on a large number of datasets and include in imbalanced-learn only the versions that show an advantage. You can see the discussion and contribute to it: https://github.com/gykovacs/smote_variants/issues/14

@MattEding would this strategy would be fine with you?

@chkoar
Copy link
Member

chkoar commented Nov 17, 2019

I believe that SPIDER would be great addition to imblearn package. I need to find some time to thoroughly review it. @MattEding thanks for the patience.

@MattEding
Copy link
Contributor Author

@glemaitre I understand the motivation for not wanting to include it until it has a more proven track record. My only reservation about trying to include it in smote_variants is that SPIDER is unrelated to SMOTE. One of its motivations was to avoid creating synthetic samples:

"Random introduction of synthetic examples by SMOTE may be questionable or difficult to justify in some domains, where it is important to preserve a link between original data and a constructed classifier (e.g., to justify suggested decisions)."
Stefanowski and Wilk, 2008. Selective Pre-processing of Imbalanced Data for Improving Classification Performance.

As a side note, you may want to revisit New Methods inviting people to contribute new sampling techniques by specifying that contributions may need to go through other repositories first.

@glemaitre
Copy link
Member

My only reservation about trying to include it in smote_variants is that SPIDER is unrelated to SMOTE.

This is a really could point, I misread when reviewing the different PR :).

My proposal was a bit too quick. @chkoar has some concerns also regarding my proposal as well. We probably need to open an issue to have a public discussion about it.

So as a summary, let's go-ahead for the inclusion of this method. I should have a bit of time to review it in the coming days. @MattEding could you solve the conflict?

@lgtm-com
Copy link

lgtm-com bot commented Nov 17, 2019

This pull request introduces 4 alerts when merging 504d601 into 9b31677 - view on LGTM.com

new alerts:

  • 2 for Unused local variable
  • 1 for Unused import
  • 1 for Syntax error

…s docstring substitution; formatted code syntax to be more congruent with rest of codebase
@lgtm-com
Copy link

lgtm-com bot commented Nov 19, 2019

This pull request introduces 3 alerts when merging 2517bd9 into f356284 - view on LGTM.com

new alerts:

  • 2 for Unused local variable
  • 1 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Nov 19, 2019

This pull request fixes 3 alerts when merging a9cd91c into f356284 - view on LGTM.com

fixed alerts:

  • 2 for Mismatch in multiple assignment
  • 1 for Wrong name for an argument in a call

@lgtm-com
Copy link

lgtm-com bot commented Nov 20, 2019

This pull request introduces 1 alert when merging a8d21e0 into 9a1191e - view on LGTM.com

new alerts:

  • 1 for Wrong name for an argument in a class instantiation

@MattEding
Copy link
Contributor Author

I am unfamiliar with how codecov determines bad lines of code coverage. With try-except blocks being normal control flow in Python--in fact it's considered good practice I am curious as to why these are being flagged as poor coverage. For example, the catching of an UnboundLocalError is a lot cleaner than say locals().get('discard_indices', []). Also I am presuming that the red flags in the helper functions are due to multiple return statements. Is this the case?

@nschlaffke
Copy link

How do the things look like? It would be nice to have this algorithm

@MattEding
Copy link
Contributor Author

@nschlaffke
It's been quite some time since I've worked on SPIDER, but I believe that I have left it as a fully working algorithm--it passes the (very small) toy test data set that I worked out manually. See examples/combine/plot_illustration_spider.py and the unit tests for details.

The things that needed to be fleshed out would be how it integrates with the rest of imblearn, and whether to lower the Code Coverage threshold rejecting try-except clauses or rewrite in different style. For example, I had created a new base class for this since it didn't feel like it would fit the role of over/under sampling. This may not be what the maintainers would desire or may want it to be refactored.

@glemaitre
Copy link
Member

I will try to have a look soon. I had to make some underground work to ensure the compatibility with scikit-learn 0.23 and I should be done with it.

@glemaitre
Copy link
Member

I will check when I get time. We can put a milestone for 0.9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants