Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] ENH: Hellinger distance tree split criterion for imbalanced data classification #437

Open
wants to merge 61 commits into
base: master
Choose a base branch
from

Conversation

EvgeniDubov
Copy link

@EvgeniDubov EvgeniDubov commented Jul 11, 2018

Reference Issue

[sklearn] Feature Request: Hellinger split criterion for classificaiton trees #9947

What does this implement/fix? Explain your changes.

Hellinger Distance as tree split criterion, cython implementation compatible with sklean tree based classification models

Any other comments?

This is my first submission, sorry in advance for the many possible things I've missed.
Looking forward for your feedback.

@pep8speaks
Copy link

pep8speaks commented Jul 11, 2018

Hello @EvgeniDubov! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-12-23 09:29:09 UTC

@codecov
Copy link

codecov bot commented Jul 11, 2018

Codecov Report

Merging #437 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #437   +/-   ##
=======================================
  Coverage   98.83%   98.83%           
=======================================
  Files          86       86           
  Lines        5317     5317           
=======================================
  Hits         5255     5255           
  Misses         62       62

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc9e483...7acac1f. Read the comment docs.

@glemaitre
Copy link
Member

Cool. I am looking forward for this contribution. Could you make a quick todo list in the first summary.

@glemaitre
Copy link
Member

From what I see:

  • We did not have any Cython up to now. So you will need to setup the project for it. You can refer to https://github.com/jakevdp/cython_template
  • You have to solve the issues raised by PEP8.
  • I think that we need to improve the example with some narrative documentation.
  • We need to have a section in the User Guide such that people can find out about this feature and in which case to use it. Probably it could be embedded in the same section than the BalancedBaggingClassifier.
  • You can add a what's new entry as well.

@EvgeniDubov
Copy link
Author

I've pushed the code for cython build support, and all the automatic checks failed.
LGTM failed because my implementation assumes the existence of sklearn/tree/_criterion.pxd and it is missing.
AppVeyor and Travis failed because Cython package is not installed.
I assume that the last will fail on the missing .pxd file too after Cython dependency will be resolved.

Please let me know if there is a way for me to configure these tools or are they administrated by the maintainers only.

@glemaitre
Copy link
Member

@EvgeniDubov I am getting some time to look at this PR.
I will probably make a force push to have a different building system for the Cython but this is not the most important thing.

I was looking at the literature and the original paper. I did not find a clear statement on how to compute the distance in a multiclass problem which is actually supported by the trees in scikit-learn.

@EvgeniDubov @DrEhrfurchtgebietend do you have a reference for multiclass?

@Gitman-code
Copy link

I have not done much multi-class classification. I do not even know how it is implemented with the traditional split criterion. Is it possible to set this up to only work for binary classification? Can we release without solving this?

@EvgeniDubov
Copy link
Author

@glemaitre ineed sklearn's 'gini' and 'entropy' support multiclass but hellinger requires some modification to support it.
Here is a quote from an abstract of a paper on this subject

In this paper we study the multi-class imbalance problem as it relates to decision trees (specifically C4.4 and HDDT), and develop a new multi-class splitting criterion. From our experiments we show that multi-class Hellinger distance decision trees, when combined with decomposition techniques, outperform C4.4.

I can contribute it as a separate Cython implementation, preferably in a separate PR.

@glemaitre
Copy link
Member

Oh nice, I did not look at this paper but the 2009 and 2011. It seems that I missed it.

I can contribute it as a separate Cython implementation, preferably in a separate PR.

I think that we should have the multi-class criterion directly. The reason is that we don't have a mechanism for raising an error if the criterion is used for a multiclass problem. However, it seems that it is quite feasible to implement the algorithm 1 in the paper that you attached.

Regarding the cython file, could you take all the cython setup from glemaitre@27fffea and just paste your criterion file and tests at the right location (+documentation). I prefer to be closer to those Cython setup (basically the major change is about the package naming).

@glemaitre glemaitre force-pushed the master branch 2 times, most recently from bbf2b12 to 513203c Compare September 7, 2018 13:26
@Gitman-code
Copy link

I am curious if we need to do something specific for how feature importance will be calculated after this change is done. There are two questions here. First, does the standard method of the sum of improvement in the criterion really generalize to all criterion. I think the answer is yes but if so then it might not be the case that this is really the definition we want. In an imbalanced case we would in theory have imbalanced features (ie nearly all the same value) which if important would be used high in the tree but not frequently. This would result in a low weight under the current definition. Would a definition of the average gain when used instead of the total gain across all uses be better? To limit discussion here I put this into a SO post.

@glemaitre
Copy link
Member

glemaitre commented Sep 11, 2018

There is 3 points to consider:

  • The feature importance at a node is normalized by the weighted number of samples at that node.
  • The information gain on the top of the tree will be more important than on the bottom of the tree. So a feature used on the top of the tree might be more important than a feature used several time at the bottom of the tree, if the information gain is actually lower.
  • The feature importance computed in the tree is a bias estimator: http://explained.ai/rf-importance/index.html and there is probably nothing to do about that apart of doing a permutation test.

@Gitman-code
Copy link

Thanks for the feedback @glemaitre .

So you agree that different split criteria could be used to calculate the feature importance in general? It seems to intuitively make sense that if it is used to build the tree it makes sense to use it to define importance.

The weighting of the importance by the number of samples at the node was sort of what got me thinking down this path. Hellinger distance is designed to be less sensitive to number of samples but I think that is only a factor in finding the split.

The permutation feature importance is a great method. I see that there are discussions to move it to sklearn.

The purpose of thinking of feature importance in this way is to make sure one does not eliminate features which are unimportant in general but crucial in a few rare outlier cases. When doing feature selection is it easy to justify dropping such features when looking at aggregate metrics like RMSE since changes to only a few predictions will only alter it by a tiny amount. Permutation feature importance would not be sensitive to this either. Or at least it will only be as sensitive as metric you use for evaluation is to such outliers. Do you know of any standard metric for identifying features of this type? Sorry this has gotten a little off topic.

…eniDubov/imbalanced-learn into hellinger_distance_criterion

# Conflicts:
#	doc/over_sampling.rst
#	doc/whats_new/v0.0.4.rst
…e_criterion

# Conflicts:
#	.gitignore
#	.travis.yml
#	appveyor.yml
#	imblearn/tensorflow/_generator.py
#	imblearn/tensorflow/tests/test_generator.py
#	imblearn/utils/_validation.py
@EvgeniDubov
Copy link
Author

@glemaitre @chkoar I've synced with master and got lint, travis and appveyor issues, none of which caused by my contribution
can you please take a look

@giladwa1
Copy link

@glemaitre @chkoar My DS team is using Hellinger distance split criterion from @EvgeniDubov private repo. We would appreciate it being part of scikit-learn-contrib. We're willing to help move this PR forward in any way possible.

@chkoar
Copy link
Member

chkoar commented Feb 17, 2020

@giladwa1 I am not familiar with the Hellinger distance yet but if people are willing to help to get this merged I am ok even if it works only for the binary case.

@glemaitre
Copy link
Member

glemaitre commented Feb 17, 2020 via email

@chkoar
Copy link
Member

chkoar commented Feb 17, 2020

The issue in imbalanced-learn is that we will be required to code in cython and then it had a lot of burden on the wheel generation which personally I would like to avoid if possible.

That is very true.

@giladwa1
Copy link

@glemaitre @chkoar Thanks for the quick reply, I will continue the discussion in the scikit-learn PR scikit-learn/scikit-learn#16478

@chkoar
Copy link
Member

chkoar commented Jul 29, 2020

@glemaitre since this PR transferred could we close this PR?

@Sandy4321
Copy link

pls clarify is it added to main code or not?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants