-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] ENH: Hellinger distance tree split criterion for imbalanced data classification #437
base: master
Are you sure you want to change the base?
[WIP] ENH: Hellinger distance tree split criterion for imbalanced data classification #437
Conversation
Hello @EvgeniDubov! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-12-23 09:29:09 UTC |
Codecov Report
@@ Coverage Diff @@
## master #437 +/- ##
=======================================
Coverage 98.83% 98.83%
=======================================
Files 86 86
Lines 5317 5317
=======================================
Hits 5255 5255
Misses 62 62 Continue to review full report at Codecov.
|
Cool. I am looking forward for this contribution. Could you make a quick todo list in the first summary. |
From what I see:
|
…eniDubov/imbalanced-learn into hellinger_distance_criterion # Conflicts: # imblearn/tree_split/setup.py
I've pushed the code for cython build support, and all the automatic checks failed. Please let me know if there is a way for me to configure these tools or are they administrated by the maintainers only. |
@EvgeniDubov I am getting some time to look at this PR. I was looking at the literature and the original paper. I did not find a clear statement on how to compute the distance in a multiclass problem which is actually supported by the trees in scikit-learn. @EvgeniDubov @DrEhrfurchtgebietend do you have a reference for multiclass? |
I have not done much multi-class classification. I do not even know how it is implemented with the traditional split criterion. Is it possible to set this up to only work for binary classification? Can we release without solving this? |
@glemaitre ineed sklearn's 'gini' and 'entropy' support multiclass but hellinger requires some modification to support it.
I can contribute it as a separate Cython implementation, preferably in a separate PR. |
Oh nice, I did not look at this paper but the 2009 and 2011. It seems that I missed it.
I think that we should have the multi-class criterion directly. The reason is that we don't have a mechanism for raising an error if the criterion is used for a multiclass problem. However, it seems that it is quite feasible to implement the algorithm 1 in the paper that you attached. Regarding the cython file, could you take all the cython setup from glemaitre@27fffea and just paste your criterion file and tests at the right location (+documentation). I prefer to be closer to those Cython setup (basically the major change is about the package naming). |
bbf2b12
to
513203c
Compare
I am curious if we need to do something specific for how feature importance will be calculated after this change is done. There are two questions here. First, does the standard method of the sum of improvement in the criterion really generalize to all criterion. I think the answer is yes but if so then it might not be the case that this is really the definition we want. In an imbalanced case we would in theory have imbalanced features (ie nearly all the same value) which if important would be used high in the tree but not frequently. This would result in a low weight under the current definition. Would a definition of the average gain when used instead of the total gain across all uses be better? To limit discussion here I put this into a SO post. |
There is 3 points to consider:
|
Thanks for the feedback @glemaitre . So you agree that different split criteria could be used to calculate the feature importance in general? It seems to intuitively make sense that if it is used to build the tree it makes sense to use it to define importance. The weighting of the importance by the number of samples at the node was sort of what got me thinking down this path. Hellinger distance is designed to be less sensitive to number of samples but I think that is only a factor in finding the split. The permutation feature importance is a great method. I see that there are discussions to move it to sklearn. The purpose of thinking of feature importance in this way is to make sure one does not eliminate features which are unimportant in general but crucial in a few rare outlier cases. When doing feature selection is it easy to justify dropping such features when looking at aggregate metrics like RMSE since changes to only a few predictions will only alter it by a tiny amount. Permutation feature importance would not be sensitive to this either. Or at least it will only be as sensitive as metric you use for evaluation is to such outliers. Do you know of any standard metric for identifying features of this type? Sorry this has gotten a little off topic. |
…eniDubov/imbalanced-learn into hellinger_distance_criterion # Conflicts: # doc/over_sampling.rst # doc/whats_new/v0.0.4.rst
- added Hellinger pyd file to MANIFEST - update cython version requirements in hellinger cython code
…e_criterion # Conflicts: # .gitignore # .travis.yml # appveyor.yml # imblearn/tensorflow/_generator.py # imblearn/tensorflow/tests/test_generator.py # imblearn/utils/_validation.py
@glemaitre @chkoar I've synced with master and got lint, travis and appveyor issues, none of which caused by my contribution |
@glemaitre @chkoar My DS team is using Hellinger distance split criterion from @EvgeniDubov private repo. We would appreciate it being part of scikit-learn-contrib. We're willing to help move this PR forward in any way possible. |
@giladwa1 I am not familiar with the Hellinger distance yet but if people are willing to help to get this merged I am ok even if it works only for the binary case. |
Speaking a bit more with the dev of scikit-learn, I think that it could be
integrated into scikit-learn directly.
It would only be for the binary case and we should have good tests and a
nice example showing its benefit.
The issue in imbalanced-learn is that we will be required to code in cython
and then it had a lot of burden on the wheel generation which personally I
would like to avoid if possible.
This somehow a cost which is a bit hidden.
…On Mon, 17 Feb 2020 at 08:57, Christos Aridas ***@***.***> wrote:
@giladwa1 <https://github.com/giladwa1> I am not familiar with the
Hellinger distance yet but if people are willing to help to get this merged
I am ok even if it works only for the binary case.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#437?email_source=notifications&email_token=ABY32P6QOZSEZJUMUFF755TRDI7OLA5CNFSM4FJO4VXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL5MXTQ#issuecomment-586861518>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABY32P6GMHFI2R2NKZS62OTRDI7OLANCNFSM4FJO4VXA>
.
--
Guillaume Lemaitre
Scikit-learn @ Inria Foundation
https://glemaitre.github.io/
|
That is very true. |
@glemaitre @chkoar Thanks for the quick reply, I will continue the discussion in the scikit-learn PR scikit-learn/scikit-learn#16478 |
@glemaitre since this PR transferred could we close this PR? |
pls clarify is it added to main code or not? |
Reference Issue
[sklearn] Feature Request: Hellinger split criterion for classificaiton trees #9947
What does this implement/fix? Explain your changes.
Hellinger Distance as tree split criterion, cython implementation compatible with sklean tree based classification models
Any other comments?
This is my first submission, sorry in advance for the many possible things I've missed.
Looking forward for your feedback.