-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Dispatch aggregate, refactor AggJoiner & AggTarget #1116
base: main
Are you sure you want to change the base?
Conversation
…dle default values
…r in polars groupby
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @TheooJ, very nice effort! Here is a first pass of comments.
On a higher level:
- Since you use the
CheckInputDataFrame
class inAggJoiner.fit_transform
, you should be able to remove the_check_dataframes
method entirely. And since_check_inputs
currently calls_check_dataframes
, you have to place them in reverse order:
self._main_check_input = CheckInputDataFrame()
X = self._main_check_input.fit_transform(X)
self._check_inputs(X)
Additionally, the check_inputs
method of the AggTarget
could be simplified because CheckInputDataFrame
does most of the checks.
-
You should add a
get_feature_names_out
method that returnsself.all_outputs_
for bothAggJoiner
andAggTarget
. -
I think we should allow
key
,aux_key
andmain_key
to be selectors.
If think the PR is good for a second pass. I probably need to rename it and add some extra things in the changelog. @Vincent-Maladiere I addressed points 1 & 2: simplifying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice effort! I think we are almost done here
skrub/_agg_joiner.py
Outdated
@@ -85,19 +173,10 @@ class AggJoiner(TransformerMixin, BaseEstimator): | |||
the join operation. | |||
If `aux_key` is an iterable, we will perform a multi-column join. | |||
cols : str or iterable of str, default=None | |||
cols : str or iterable of str or selector, default=s.all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we clarify what selector means?
cols : str or iterable of str or selector, default=s.all() | |
cols : str or iterable of str or skrub selector, default=s.all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at this point the _selectors
module is still private right? maybe it should be made public if we document this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder how @GaelVaroquaux feels about this
For context, we allowed cols
to be a selector in AggJoiner
and AggTarget
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a good time to start writing the dev doc in another PR and fixing #991
Sorry, there's some duplicate with Jerome's feedback, we reviewed it simultaneously |
Co-authored-by: Vincent M <[email protected]> Co-authored-by: Jérôme Dockès <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notwithstanding the discussion regarding the documentation of skrub selector (which may be done in another PR), this LGTM! Congratulations @TheooJ :)
Let's aim to remove _namespace.py
, _polars.py
, and _pandas
in a follow-up PR!
After discussion with @GaelVaroquaux, it was decided we won't make selectors public here. Let's open a meta issue on their support and documentation first
I'm on it :) |
The goal of this PR is to dispatch
aggregate
, currently written in two files, by directly implementing it in_agg_joiner.py
.Following discussions with @Vincent-Maladiere and @jeromedockes,
AggJoiner
andAggTarget
now require theoperations
parameter by default, and will try to apply all operations on all columns — as opposed to now, where columns are separated in categorical and numeric and only some operations are computed on each category.I’m planning on doing follow ups to completely remove the _pandas.py, _polars.py, _namespace.py files, and on refactoring
AggTarget
with cross-fitting.