Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Dispatch aggregate, refactor AggJoiner & AggTarget #1116

Open
wants to merge 72 commits into
base: main
Choose a base branch
from

Conversation

TheooJ
Copy link
Contributor

@TheooJ TheooJ commented Oct 17, 2024

The goal of this PR is to dispatch aggregate, currently written in two files, by directly implementing it in _agg_joiner.py.

Following discussions with @Vincent-Maladiere and @jeromedockes, AggJoiner and AggTarget now require the operations parameter by default, and will try to apply all operations on all columns — as opposed to now, where columns are separated in categorical and numeric and only some operations are computed on each category.

I’m planning on doing follow ups to completely remove the _pandas.py, _polars.py, _namespace.py files, and on refactoring AggTarget with cross-fitting.

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @TheooJ, very nice effort! Here is a first pass of comments.

On a higher level:

  1. Since you use the CheckInputDataFrame class in AggJoiner.fit_transform, you should be able to remove the _check_dataframes method entirely. And since _check_inputs currently calls _check_dataframes, you have to place them in reverse order:
self._main_check_input = CheckInputDataFrame()
X = self._main_check_input.fit_transform(X)
self._check_inputs(X)

Additionally, the check_inputs method of the AggTarget could be simplified because CheckInputDataFrame does most of the checks.

  1. You should add a get_feature_names_out method that returns self.all_outputs_ for both AggJoiner and AggTarget.

  2. I think we should allow key, aux_key and main_key to be selectors.

skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_join_utils.py Show resolved Hide resolved
@TheooJ
Copy link
Contributor Author

TheooJ commented Nov 5, 2024

If think the PR is good for a second pass. I probably need to rename it and add some extra things in the changelog.

@Vincent-Maladiere I addressed points 1 & 2: simplifying AggTarget's _check_dataframes (it was more of a complete RFC of AggTarget) and adding a get_feature_names_out method. Point 3 I didn't implement because I'm not sure it's a good idea to indicate users "you should use selectors to choose keys", so I didn't implement it in this PR. Shouldn't be a blocker for the review though and it's a quite minor change IMO

skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice effort! I think we are almost done here

skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
@@ -85,19 +173,10 @@ class AggJoiner(TransformerMixin, BaseEstimator):
the join operation.
If `aux_key` is an iterable, we will perform a multi-column join.
cols : str or iterable of str, default=None
cols : str or iterable of str or selector, default=s.all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we clarify what selector means?

Suggested change
cols : str or iterable of str or selector, default=s.all()
cols : str or iterable of str or skrub selector, default=s.all()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at this point the _selectors module is still private right? maybe it should be made public if we document this here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder how @GaelVaroquaux feels about this
For context, we allowed cols to be a selector in AggJoiner and AggTarget

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a good time to start writing the dev doc in another PR and fixing #991

skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Outdated Show resolved Hide resolved
skrub/_agg_joiner.py Show resolved Hide resolved
skrub/_agg_joiner.py Show resolved Hide resolved
@Vincent-Maladiere
Copy link
Member

Sorry, there's some duplicate with Jerome's feedback, we reviewed it simultaneously

@jeromedockes jeromedockes added this to the 0.3.2 milestone Nov 7, 2024
@TheooJ TheooJ changed the title Dispatch aggregate [ENH] Dispatch aggregate, refactor AggJoiner & AggTarget Nov 8, 2024
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notwithstanding the discussion regarding the documentation of skrub selector (which may be done in another PR), this LGTM! Congratulations @TheooJ :)

Let's aim to remove _namespace.py, _polars.py, and _pandas in a follow-up PR!

@TheooJ
Copy link
Contributor Author

TheooJ commented Nov 13, 2024

After discussion with @GaelVaroquaux, it was decided we won't make selectors public here. Let's open a meta issue on their support and documentation first

Let's aim to remove _namespace.py, _polars.py, and _pandas in a follow-up PR!

I'm on it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants