-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] ENH: SPIDER Sampling Algorithm #603
base: master
Are you sure you want to change the base?
Conversation
…s intended behind scenes; benchmark code
Hello @MattEding! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-11-22 00:59:05 UTC |
Thanks Matt! That would be a nice addition! |
Rough draft for online documentation illustrations of how the SPIDER algorithm works for arbitrary X and y values. The generated plots work as planned; just need to add written text discussing how to interpret graphs. Also needs to be moved from .ipynb to .py that can be understood by sphinx. |
Codecov Report
@@ Coverage Diff @@
## master #603 +/- ##
==========================================
- Coverage 98.46% 98.43% -0.03%
==========================================
Files 82 86 +4
Lines 4886 5062 +176
==========================================
+ Hits 4811 4983 +172
- Misses 75 79 +4
Continue to review full report at Codecov.
|
With Travis CI |
I would suggest moving this implementation into smote_variants. The idea behind this move is to benchmark the smote variants on a common benchmark on a large number of datasets and include in imbalanced-learn only the versions that show an advantage. You can see the discussion and contribute to it: https://github.com/gykovacs/smote_variants/issues/14 @MattEding would this strategy would be fine with you? |
I believe that |
@glemaitre I understand the motivation for not wanting to include it until it has a more proven track record. My only reservation about trying to include it in smote_variants is that SPIDER is unrelated to SMOTE. One of its motivations was to avoid creating synthetic samples:
As a side note, you may want to revisit New Methods inviting people to contribute new sampling techniques by specifying that contributions may need to go through other repositories first. |
This is a really could point, I misread when reviewing the different PR :). My proposal was a bit too quick. @chkoar has some concerns also regarding my proposal as well. We probably need to open an issue to have a public discussion about it. So as a summary, let's go-ahead for the inclusion of this method. I should have a bit of time to review it in the coming days. @MattEding could you solve the conflict? |
This pull request introduces 4 alerts when merging 504d601 into 9b31677 - view on LGTM.com new alerts:
|
…s docstring substitution; formatted code syntax to be more congruent with rest of codebase
This pull request introduces 3 alerts when merging 2517bd9 into f356284 - view on LGTM.com new alerts:
|
This pull request fixes 3 alerts when merging a9cd91c into f356284 - view on LGTM.com fixed alerts:
|
…e "do not"; add cleaning test for minority error
This pull request introduces 1 alert when merging a8d21e0 into 9a1191e - view on LGTM.com new alerts:
|
I am unfamiliar with how codecov determines bad lines of code coverage. With try-except blocks being normal control flow in Python--in fact it's considered good practice I am curious as to why these are being flagged as poor coverage. For example, the catching of an |
How do the things look like? It would be nice to have this algorithm |
@nschlaffke The things that needed to be fleshed out would be how it integrates with the rest of imblearn, and whether to lower the Code Coverage threshold rejecting try-except clauses or rewrite in different style. For example, I had created a new base class for this since it didn't feel like it would fit the role of over/under sampling. This may not be what the maintainers would desire or may want it to be refactored. |
I will try to have a look soon. I had to make some underground work to ensure the compatibility with scikit-learn 0.23 and I should be done with it. |
I will check when I get time. We can put a milestone for 0.9. |
As per requested by the pinned New Methods, I have implemented the Selective Pre-processing of Imbalanced Data (SPIDER) sampling algorithm.
I have developed unit tests based on drawing out a sample dataset and working out by hand what the expected results would be as it is deterministic. See the following notebook for diagrams here.
Currently, the implementation for dense and sparse return the same data points but in different orders (np.lexsort
will show they are indeed the same). Consequently this fails the existing unit test that compares dense vs sparse outputs. I would rather not have to require sorting results to ensure that test passes due to the overhead for sorting large datasets. Maybe I can just have a parameter that defaultssort=True
to give the option to bypass this issue.The only other unit test I saw fail with PyTest is aI had chosen do this because SPIDER does both cleaning and oversampling and I did not feel that inheriting from either was appropriate--oversamplers allowwith raises(MESSAGE)
, but I will fix that later since I am not sure if other developers will want to keep my newsample_type = 'preprocess-strategy'
.sampling_strategy
as afloat
which does not make sense for this algorithm, and cleaners only really allow for under-sampling sampling strategies which is also inappropriate.Benchmark results using cross-validation with the mean of 5 folds comparing None, NCR, SMOTE, and the 3 SPIDER variants are here in the PKL, CSV, and PNG folders. I used the Zenodo datasets (excluding set 26 due to it being a large dataset for my local computer to work with in a time effective manner). I set all
n_jobs=-1
on my 4 core macbook if you want to infer thetime
values. Additionally when a scorer was undefined, I left the value as 0 rather than changing it to NaN.TODO:
Resolve the two failing tests I addressed aboveOnline reference explaining the algorithm