-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix MRMRFeatureSelectionTransform
: change less_is_better
logic, add drop_zero
mode
#314
Conversation
🚀 Deployed on https://deploy-preview-314--etna-docs.netlify.app |
@@ -59,6 +71,10 @@ def mrmr( | |||
fast_redundancy: | |||
* True: compute redundancy only inside the the segments, time complexity :math:`O(top\_k * n\_segments * n\_features * history\_len)` | |||
* False: compute redundancy for all the pairs of segments, time complexity :math:`O(top\_k * n\_segments^2 * n\_features * history\_len)` | |||
drop_zero: | |||
If True, drop features with zero relevance before MRMR. If top_k is greater number of features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear for me. If top_k is greater than number of features with relevance >0, select all non-zero relevance features with randomly selected zero relevance features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I select all non-zero relevance features and add some features with zero relevance in order to select exact top_k features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean what is written is very unclear and complicated. Could you rewrite more clear?
filter(lambda feature: is_not_relevant(relevance, feature, atol), not_selected_features) | ||
) | ||
not_selected_features = list( | ||
filter(lambda feature: is_relevant(relevance, feature, atol), not_selected_features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you compute twice the same logic. Could you do it in one compute?
top_k = min(top_k, len(all_features)) | ||
if drop_zero is True: | ||
not_relevant_features = list( | ||
filter(lambda feature: is_not_relevant(relevance, feature, atol), not_selected_features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we compare with atol? Why not strong equality to 0?
@@ -235,8 +235,14 @@ def _fit(self, df: pd.DataFrame) -> "MRMRFeatureSelectionTransform": | |||
df_features = df_features.loc[df_features.first_valid_index() :] | |||
relevance_table = self.relevance_table(df_target, df_features, **self.relevance_params) | |||
|
|||
if relevance_table.values.min() < 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should add it into relevance_table, not here
Here you should have guarantees that all values >= 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? We can use relevance table for some other things, and only in MRMR we need non negative relevance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From any relevance table the importance must be non-negative just because it is relevance
|
||
@pytest.mark.parametrize("fast_redundancy", ([True, False])) | ||
@pytest.mark.parametrize("relevance_table", ([ModelRelevanceTable()])) | ||
def test_mrmr_drop_zero_mode_sanity_check(relevance_table, ts_with_regressors, fast_redundancy): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the drop zero mode tests here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add test where top_k > number of regressors, test where number of regressors with non-zero relevance <top_k<number of regressors and test to check sanity. I think this covers all the main cases. Maybe I should rename some tests
|
||
@pytest.mark.parametrize("fast_redundancy", ([True, False])) | ||
@pytest.mark.parametrize("relevance_table", ([ModelRelevanceTable()])) | ||
def test_mrmr_select_top_k_regressors_in_drop_zero_mode(relevance_table, ts_with_regressors, fast_redundancy): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is drop zero mode tests here x2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answer in the next comment
|
||
@pytest.mark.parametrize("fast_redundancy", ([True, False])) | ||
@pytest.mark.parametrize("relevance_table", ([ModelRelevanceTable()])) | ||
def test_mrmr_drop_zero_mode_sanity_check(relevance_table, ts_with_regressors, fast_redundancy): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You already have test with this name. And with meaning too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this test I check that I select correct regressors in previous test I check that if top_k >= number of features with non-zero relevance it still select top_k
tests/test_transforms/test_feature_selection/test_feature_importance_transform.py
Show resolved
Hide resolved
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #314 +/- ##
==========================================
- Coverage 88.84% 86.38% -2.47%
==========================================
Files 203 224 +21
Lines 14328 15258 +930
==========================================
+ Hits 12730 13180 +450
- Misses 1598 2078 +480 ☔ View full report in Codecov by Sentry. |
@@ -30,11 +30,22 @@ class AggregationMode(str, Enum): | |||
} | |||
|
|||
|
|||
def is_relevant(relevance, feature, atol): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Atol?
return not relevance.loc[feature] == 0 | ||
|
||
|
||
def is_not_relevant(relevance, feature, atol): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really need this functions? You could just add this logic into filter explicitly without functions
@@ -59,6 +71,10 @@ def mrmr( | |||
fast_redundancy: | |||
* True: compute redundancy only inside the the segments, time complexity :math:`O(top\_k * n\_segments * n\_features * history\_len)` | |||
* False: compute redundancy for all the pairs of segments, time complexity :math:`O(top\_k * n\_segments^2 * n\_features * history\_len)` | |||
drop_zero: | |||
If True, drop features with zero relevance before MRMR. If top_k is greater number of features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean what is written is very unclear and complicated. Could you rewrite more clear?
@@ -235,8 +235,14 @@ def _fit(self, df: pd.DataFrame) -> "MRMRFeatureSelectionTransform": | |||
df_features = df_features.loc[df_features.first_valid_index() :] | |||
relevance_table = self.relevance_table(df_target, df_features, **self.relevance_params) | |||
|
|||
if relevance_table.values.min() < 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From any relevance table the importance must be non-negative just because it is relevance
@@ -83,25 +83,6 @@ def test_mrmr_right_regressors(df_with_regressors, relevance_method, expected_re | |||
assert set(selected_regressors) == set(expected_regressors) | |||
|
|||
|
|||
@pytest.mark.parametrize("fast_redundancy", [True, False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
for column in df_selected.columns.get_level_values("feature"): | ||
if column.startswith("regressor"): | ||
selected_regressors.add(column) | ||
assert len(selected_regressors) == 15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
15 is a magic constant. Take it from ts_with_regressors
@pytest.mark.parametrize("fast_redundancy", ([True, False])) | ||
@pytest.mark.parametrize("relevance_table", ([ModelRelevanceTable()])) | ||
def test_mrmr_select_top_k_regressors_in_drop_zero_mode(relevance_table, ts_with_regressors, fast_redundancy): | ||
"""Check that transform selects right top_k regressors.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its wrong description
------- | ||
selected_features: List[str] | ||
list of ``top_k`` selected regressors, sorted by their importance | ||
Maximum Relevance and Minimum Redundancy feature selection method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you add spaces here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is still relevant.
* True: compute redundancy only inside the the segments, time complexity :math:`O(top\_k * n\_segments * n\_features * history\_len)` | ||
* False: compute redundancy for all the pairs of segments, time complexity :math:`O(top\_k * n\_segments^2 * n\_features * history\_len)` | ||
drop_zero: | ||
* True: use only features with relevance > 0 in calculations, if their number is less than zero |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
less than top_k
maybe?
for column in df_selected.columns.get_level_values("feature"): | ||
if column.startswith("regressor"): | ||
selected_regressors.add(column) | ||
assert len(selected_regressors) == 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you know all relevanced features, add a check for their presence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of this test is to check that in drop_zero mode top_k features are selected even when there are fewer features with relevance > 0 than top_k, a specific list of regressors is not needed here
MRMRFeatureSelectionTransform
: change less_is_better
logic, add drop_zero
mode
CHANGELOG.md
Outdated
@@ -873,4 +874,4 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 | |||
- Distribution plot | |||
- Anomalies (Outliers) plot | |||
- Backtest (CrossValidation) plot | |||
- Forecast plot | |||
- Forecast plot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably broke newline here, let's fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren' you going to fix that?
------- | ||
selected_features: List[str] | ||
list of ``top_k`` selected regressors, sorted by their importance | ||
Maximum Relevance and Minimum Redundancy feature selection method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is still relevant.
Before submitting (must do checklist)
Fix MRMRFeatureSelectionTransform, change score formula and add drop_zero mode
Proposed Changes
Closing issues
Closes #308.