Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Narwhals migration for dataframe-agnostic codebase #658

Closed
FBruzzesi opened this issue May 7, 2024 · 23 comments
Closed

[FEATURE] Narwhals migration for dataframe-agnostic codebase #658

FBruzzesi opened this issue May 7, 2024 · 23 comments
Labels
enhancement New feature or request

Comments

@FBruzzesi
Copy link
Collaborator

FBruzzesi commented May 7, 2024

Description

Creating this issue to keep track of which classes/function could benefit from adopting Narwhals.

Class/Function/Module Status Related Notes
preprocessing.ColumnDropper Solved in #655
preprocessing.ColumnSelector Solved in #659
preprocessing.PandasTypeSelector Solved in #670 Consider creating another class TypeSelector and deprecate this one
common.TrainOnlyTransformerMixin 🔲 Uses specific pandas hashing functionality. I wonder how crucial is to hash the index as well. If it's not we could just use .to_numpy() and hash the array data?
model_selection.TimeGapSplit Solved in #668
model_selection.GroupedTimeSeriesSplit 🔲 Related to #605
projections.InformationFilter Solved in commit
meta.RegressionOutlierDetector Solved in #665
meta.hierarchical_predictor.py Solved in #667
meta.grouped_transformer.py Solved in #667
meta.grouped_predictor.py Solved in #667
linear_models._FairClassifier Solved in #669
pandas_utils.py 🚧 Partially #661 (*)
datasets.py 🚫 It would require read_csv function

Personally I would wait to have at least preprocessing.pandastransformers.py entire migration before bumping to v0.9.0.

cc @MarcoGorelli @anopsy

(*) Regarding pandas_utils

Changing log_step to narwhals is fairly easy (~4 lines of code), however as this decorator is supposed to work for any function that operates on pandas, doing so would limit its functionality. It could be reasonable to add another one which although restricted to narwhals methods, it can interoperate with all its compatible dataframes.

Legend

✅ Done
🚧 WIP
🔲 Not Started
🚫 Won't do

@FBruzzesi FBruzzesi added the enhancement New feature or request label May 7, 2024
@anopsy
Copy link
Contributor

anopsy commented May 7, 2024

I'm on it

@MarcoGorelli
Copy link
Contributor

Thanks for making this tracker!

I think grouping shouldn't be an issue, there's an example of that here: https://narwhals-dev.github.io/narwhals/basics/dataframe/#example-2-group-by-and-mean

If I remember correctly, TimeGapSplit relies on index alignment, so that one may need a bit more discussion

@anopsy anopsy mentioned this issue May 8, 2024
9 tasks
@DeaMariaLeon
Copy link
Contributor

Before this (nice) tracker was opened, I was taking a look at using Narwhals on _FairClassifier per Marco's suggestion. So, I guess I should change that?

Also, I'm trying to run the tests but I'm having issues installing what's needed. Specifically, pystest is complaining becausecvxpyis not installed. That can't be installed because "Could not build wheels for osqp". I believe there is an issue with the macs with M1/M2. Where can I ask about it? (I don't find a better place than here, sorry)..
Friendly ping to @FBruzzesi, just in case. :-)

@FBruzzesi
Copy link
Collaborator Author

Hey @DeaMariaLeon, thanks for the ping. I forgot to add _FairClassifier to the list 🙈 Let me fix that in a moment.

Regarding cvxpy, I tried to do a quick search, but couldn't find related material. Maybe osqp build from source guide could help. Keep me posted on that 😇

@koaning
Copy link
Owner

koaning commented May 8, 2024

One thing about that FairClassifier ... there may be a solid reason to consider porting that over to fairlearn at some point. The thinking here is that while scikit-lego is a cool place for somewhat experimental and fun components ... fairness might be a more serious topic so it might be a better home for our fairness tools. No formal decision has ever been made regarding this but it was something that was always in the back of people's mind.

@DeaMariaLeon
Copy link
Contributor

I'll leave FairClassifier for later then. What I noticed is that check_X_y must be kept, as it checks that X is not a sparse matrix (IIUC). The implementation gets more complicated..

Regarding cvxpy, thanks for the help @FBruzzesi. Everything is installed now... But when running pytest, there are 14 errors. I wonder if I should ignore that (Running tests removing my changes):

ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes0-prior0-cfm0-0.973]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes1-prior1-cfm1-0.973]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes2-prior2-cfm2-0.973]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes3-prior3-cfm3-0.985]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes4-prior4-cfm4-1.0]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes5-prior5-cfm5-0.2]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes6-prior6-cfm6-0.1]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes7-prior7-cfm7-0.696]
ERROR tests/test_meta/test_subjective_classifier.py::test_predict_proba[predict_proba-expected_probas0]
ERROR tests/test_meta/test_subjective_classifier.py::test_predict_proba[confusion_matrix-expected_probas1]
ERROR tests/test_meta/test_subjective_classifier.py::test_predict_proba[both-expected_probas2]
ERROR tests/test_preprocessing/test_outlier_remover.py::test_no_outliers
ERROR tests/test_preprocessing/test_outlier_remover.py::test_remove_outlier
ERROR tests/test_preprocessing/test_outlier_remover.py::test_do_not_refit

@DeaMariaLeon
Copy link
Contributor

Maybe we should write which function we are working on, so we don't overlap?

I would like to work on PandasTypeSelector if that's ok.

@DeaMariaLeon
Copy link
Contributor

Scikit-learn has a very similar function than PandasTypeSelector: https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html

Is TypesSelector still needed then? (I imagine that it is, just making sure here).

@koaning
Copy link
Owner

koaning commented May 8, 2024

Some of those tools were added 5-6 years ago back when sklearn did not support those features. Scikit is a lot further now but on the scikit-lego side of things we've kept things around since we didn't really see a reason to remove them. Since we're interested in giving narwhals a spin here I'd argue it'd probably be good to keep them around if only because the implementation exercise will also demonstrate that it can be used in other projects.

@DeaMariaLeon
Copy link
Contributor

Narwhals is not ready to easily work for PandasTypeSelector yet.. per Marco's suggestion I went back to work with linear_models._FairClassifier.
@FBruzzesi

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented May 11, 2024

Been looking into TimeGapSplit:

  • cv = TimeGapSplit(date_serie) is instantiated using a pd.Series which contains dates as values, and some index
  • cv.split(X, y) then aligns X with the index of date_serie

However, other than the index, X has no knowledge of how it relates to date_serie, so... 🤔 not sure how we're going to get this one to work for index-less libraries. Maybe for those, require that X is the same length as date_serie, and then align positionally?

@FBruzzesi
Copy link
Collaborator Author

@DeaMariaLeon sure, no problem for me! Let me know if I can help on the Narwhals side to make the missing bits available.
@MarcoGorelli TimeGapSplit does a few things in a bit of unconventional way imho.

Maybe for those, require that X is the same length as date_serie, and then align positionally?

This is a good assumption in my opinion (but I am biased as I do the same in timebasedcv)

@FBruzzesi
Copy link
Collaborator Author

@MarcoGorelli I may have made a commit bypassing review by accident 🙈. Could you take a look at it when you have the time?

It should make InformationFilter dataframe agnostic

@MarcoGorelli
Copy link
Contributor

It should make InformationFilter dataframe agnostic

Nice, looks good! looks like it preserves the existing logic

This has been quite a group effort, and much progress has been made - at what point do you think a release might be warranted? I'm presenting Narwhals at PyCon Italia in the 23rd of May (in 12 days), would be pretty cool if I could give a live demo of scikit-lego 😎 I understand if you'd rather wait until there's a bit more though, no pressure!

@FBruzzesi
Copy link
Collaborator Author

@MarcoGorelli that would be grand! Let's try to make a release within this week (or next weekend at latest?!).

Personally I would love to be able to replace PandasTypeSelector as well and have a "Hey we replaced the entire dataframe module to be agnostic"-release 😂

@koaning
Copy link
Owner

koaning commented May 12, 2024

would be pretty cool if I could give a live demo of scikit-lego 😎 I understand if you'd rather wait until there's a bit more though, no pressure!

@MarcoGorelli the stuff that's already been done seems sufficient to me as a "demo". It's mainly to report that folks are indeed working on the integration and that sofar it seems to work.

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented May 12, 2024

Totally! I just meant, it would be nice to be able to demo this result:

In [15]: %timeit add_lags(df, ['a', 'b', 'c'], [1, 2, 3, 4, 5])
4.35 ms ± 206 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [16]: %timeit pl.from_pandas(add_lags(df.to_pandas(), ['a', 'b', 'c'], [1, 2, 3, 4, 5]))
140 ms ± 1.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [17]: type(df)
Out[17]: polars.dataframe.frame.DataFrame

and say something like "This is why "just convert to pandas" isn't a satisfying solution. Native Polars support brings you library to another level - you can use this today, go out and pip install -U sklego and pass in Polars dataframes as you please!"

Let's try to make a release within this week (or next weekend at latest?!).

Sounds great! Given the pace at which things are moving, it may be realistic to have the whole thing done by then. If in a couple of places, non-pandas input is converted to pandas under the hood (like the grouped predictor in #667) then TBH I think that's fine as a temporary solution. And I don't think it would be temporary for too long 🚀

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented May 13, 2024

Consider creating another class TypeSelector and deprecate this one [PandasTypeSelector]

Been thinking about this one. I think what would be good to aim for would be:

  • backwards compatibility - pandas users currently using PandasTypeSelector shouldn't notice the change
  • simple to explain for non-pandas users
  • a single, consistent API

One suggestion I have is:

  • if the backing dataframe is pandas, then whatever the user passes goes straight to select_dtypes (no change from the status quo)
  • else, then the user can pass one of the following strings (and only these are supported)
    • 'number'
    • 'string'
    • 'date'
    • 'datetime'
    • 'datetimetz'
    • 'timedelta'
    • 'category'
    • 'int64', 'int32', 'int16', 'int8', 'float64', 'float32', 'uint64', 'uint32', 'uint16', 'uint8'
    • 'bool'

Like this, scikit-lego keeps the same API for pandas and non-pandas users. For pandas, it keeps using select_dtypes (so, perfect backwards compatibility), whereas for non-pandas it uses Narwhals' selectors.by_dtype (which I still need to implement, but it shouldn't be too bad) under the hood but keeps the existing user-facing API


An alternative would be to let users passing in Narwhals dtypes, but I'm not sure that end users should know about Narwhals 🤔

@FBruzzesi
Copy link
Collaborator Author

I just merged #669 and #670. I would argue we are in a very good position for a release, I will open a PR to merge the dev branch into main and bump version.

I am sorry that I am a bit stuck on the GroupedPredictor for now, but it will get there soon(ish) 😁

@MarcoGorelli
Copy link
Contributor

cool, thanks!

granted, I'm slightly biased as I'm quite keen to present this as "you can try this today!" at the conference, but to be honest I don't think GroupedPredictor should be a blocker - we can definitely solve it, I don't see any theoretical reasons we can't do it, it'll just take a little longer

@FBruzzesi
Copy link
Collaborator Author

Closed by #671
Thanks everyone! This was a huge effort, very appreciated and we are quite enthusiastic by the result!

@koaning
Copy link
Owner

koaning commented May 25, 2024

Just so folks know, it's all live now. Might make sense to see if we can collaborate on making a splash with an announcement?

@MarcoGorelli
Copy link
Contributor

Might make sense to see if we can collaborate on making a splash with an announcement?

Yup! The blog post is now live: https://labs.quansight.org/blog/scikit-lego-narwhals

Some posts I made:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants