-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MeanEncoderTransform
#413
Conversation
🚀 Deployed on https://deploy-preview-413--etna-docs.netlify.app |
global_means = dict(zip(segments, global_means)) | ||
|
||
global_means_category = {} | ||
for segment in segments: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we, in theory, groupby by both "segment" and in_column
to get rid of this cycle over segments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It remains valid.
tests/test_transforms/test_encoders/test_mean_encoder_transform.py
Outdated
Show resolved
Hide resolved
tests/test_transforms/test_encoders/test_mean_encoder_transform.py
Outdated
Show resolved
Hide resolved
tests/test_transforms/test_encoders/test_mean_encoder_transform.py
Outdated
Show resolved
Hide resolved
tests/test_transforms/test_encoders/test_mean_encoder_transform.py
Outdated
Show resolved
Hide resolved
intersected_df.loc[segment_df.index, self.out_column] = feature | ||
if self.handle_missing is MissingMode.global_mean: | ||
nan_index = segment_df[segment_df[self.in_column].isnull()].index | ||
expanding_mean = y.expanding().mean().shift().fillna(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't very clear that first values are filled with 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
global_means = dict(zip(segments, global_means)) | ||
|
||
global_means_category = {} | ||
for segment in segments: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It remains valid.
import numpy as np | ||
import pandas as pd | ||
from bottleneck import nanmean | ||
from pandas import Timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any good reason for this import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this import? Can't we just use pd.Timestamp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
||
self._global_means: Optional[Union[float, Dict[str, float]]] = None | ||
self._global_means_category: Optional[Union[Dict[str, float], Dict[str, Dict[str, float]]]] = None | ||
self._last_timestamp: Optional[Timestamp] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should have type: Union[Timestamp, int, None
].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timestamp can be None in TSDataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timestamp can be int.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could write Optional[Union[Timestamp, int]]
, but it probably easier to write Union[Timestamp, int, None]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
categories = pd.unique(df.loc[:, self.idx[:, self.in_column]].values.ravel()) | ||
|
||
cumstats = pd.DataFrame(data={"sum": 0, "count": 0, self.in_column: categories}) | ||
start_index = np.arange(0, len(timestamps) * n_segments, len(timestamps)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is it for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indexes in flatten df for one timestamp
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #413 +/- ##
===========================================
+ Coverage 9.61% 86.72% +77.10%
===========================================
Files 226 227 +1
Lines 15594 15753 +159
===========================================
+ Hits 1500 13662 +12162
+ Misses 14094 2091 -12003 ☔ View full report in Codecov by Sentry. |
Before submitting (must do checklist)
Proposed Changes
Closing issues
closes #12