Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MeanEncoderTransform #413

Merged
merged 15 commits into from
Jul 12, 2024
Merged

Add MeanEncoderTransform #413

merged 15 commits into from
Jul 12, 2024

Conversation

egoriyaa
Copy link
Collaborator

Before submitting (must do checklist)

  • Did you read the contribution guide?
  • Did you update the docs? We use Numpy format for all the methods and classes.
  • Did you write any new necessary tests?
  • Did you update the CHANGELOG?

Proposed Changes

Closing issues

closes #12

Copy link

github-actions bot commented Jun 25, 2024

🚀 Deployed on https://deploy-preview-413--etna-docs.netlify.app

@github-actions github-actions bot temporarily deployed to pull request June 25, 2024 22:37 Inactive
etna/transforms/encoders/mean_encoder.py Outdated Show resolved Hide resolved
etna/transforms/encoders/mean_encoder.py Outdated Show resolved Hide resolved
etna/transforms/encoders/mean_encoder.py Outdated Show resolved Hide resolved
etna/transforms/encoders/mean_encoder.py Outdated Show resolved Hide resolved
etna/transforms/encoders/mean_encoder.py Outdated Show resolved Hide resolved
@github-actions github-actions bot temporarily deployed to pull request July 2, 2024 14:52 Inactive
@github-actions github-actions bot temporarily deployed to pull request July 8, 2024 13:20 Inactive
@github-actions github-actions bot temporarily deployed to pull request July 8, 2024 14:35 Inactive
etna/transforms/encoders/mean_encoder.py Outdated Show resolved Hide resolved
etna/transforms/encoders/mean_encoder.py Show resolved Hide resolved
etna/transforms/encoders/mean_encoder.py Outdated Show resolved Hide resolved
etna/transforms/encoders/mean_encoder.py Outdated Show resolved Hide resolved
global_means = dict(zip(segments, global_means))

global_means_category = {}
for segment in segments:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we, in theory, groupby by both "segment" and in_column to get rid of this cycle over segments?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It remains valid.

intersected_df.loc[segment_df.index, self.out_column] = feature
if self.handle_missing is MissingMode.global_mean:
nan_index = segment_df[segment_df[self.in_column].isnull()].index
expanding_mean = y.expanding().mean().shift().fillna(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't very clear that first values are filled with 0.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@github-actions github-actions bot temporarily deployed to pull request July 11, 2024 09:17 Inactive
etna/transforms/encoders/mean_encoder.py Show resolved Hide resolved
global_means = dict(zip(segments, global_means))

global_means_category = {}
for segment in segments:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It remains valid.

etna/transforms/encoders/mean_encoder.py Outdated Show resolved Hide resolved
etna/transforms/encoders/mean_encoder.py Outdated Show resolved Hide resolved
import numpy as np
import pandas as pd
from bottleneck import nanmean
from pandas import Timestamp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any good reason for this import.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this import? Can't we just use pd.Timestamp?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


self._global_means: Optional[Union[float, Dict[str, float]]] = None
self._global_means_category: Optional[Union[Dict[str, float], Dict[str, Dict[str, float]]]] = None
self._last_timestamp: Optional[Timestamp] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should have type: Union[Timestamp, int, None].

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timestamp can be None in TSDataset?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timestamp can be int.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could write Optional[Union[Timestamp, int]], but it probably easier to write Union[Timestamp, int, None].

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

etna/transforms/encoders/mean_encoder.py Show resolved Hide resolved
categories = pd.unique(df.loc[:, self.idx[:, self.in_column]].values.ravel())

cumstats = pd.DataFrame(data={"sum": 0, "count": 0, self.in_column: categories})
start_index = np.arange(0, len(timestamps) * n_segments, len(timestamps))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is it for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indexes in flatten df for one timestamp

@github-actions github-actions bot temporarily deployed to pull request July 11, 2024 10:32 Inactive
@github-actions github-actions bot temporarily deployed to pull request July 11, 2024 11:08 Inactive
@github-actions github-actions bot temporarily deployed to pull request July 11, 2024 22:28 Inactive
@egoriyaa egoriyaa requested a review from d-a-bunin July 12, 2024 07:54
d-a-bunin
d-a-bunin previously approved these changes Jul 12, 2024
@github-actions github-actions bot temporarily deployed to pull request July 12, 2024 11:58 Inactive
@egoriyaa egoriyaa self-assigned this Jul 12, 2024
Copy link

codecov bot commented Jul 12, 2024

Codecov Report

Attention: Patch coverage is 97.88732% with 3 lines in your changes missing coverage. Please review.

Project coverage is 86.72%. Comparing base (4a8bbb5) to head (86d0cab).
Report is 2 commits behind head on master.

Files Patch % Lines
etna/transforms/encoders/mean_encoder.py 97.85% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master     #413       +/-   ##
===========================================
+ Coverage    9.61%   86.72%   +77.10%     
===========================================
  Files         226      227        +1     
  Lines       15594    15753      +159     
===========================================
+ Hits         1500    13662    +12162     
+ Misses      14094     2091    -12003     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@egoriyaa egoriyaa requested a review from d-a-bunin July 12, 2024 13:01
@github-actions github-actions bot temporarily deployed to pull request July 12, 2024 14:52 Inactive
@egoriyaa egoriyaa merged commit 12f19fb into master Jul 12, 2024
16 checks passed
@egoriyaa egoriyaa deleted the issue-12 branch September 9, 2024 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add MeanEncoderTransform
2 participants