[FEATURE] Narwhals migration for dataframe-agnostic codebase #658

FBruzzesi · 2024-05-07T11:17:02Z

Description

Creating this issue to keep track of which classes/function could benefit from adopting Narwhals.

Class/Function/Module	Status	Related	Notes
`preprocessing.ColumnDropper`	✅	Solved in #655
`preprocessing.ColumnSelector`	✅	Solved in #659
`preprocessing.PandasTypeSelector`	✅	Solved in #670	Consider creating another class `TypeSelector` and deprecate this one
`common.TrainOnlyTransformerMixin`	🔲		Uses specific pandas hashing functionality. I wonder how crucial is to hash the index as well. If it's not we could just use `.to_numpy()` and hash the array data?
`model_selection.TimeGapSplit`	✅	Solved in #668
`model_selection.GroupedTimeSeriesSplit`	🔲		Related to #605
`projections.InformationFilter`	✅	Solved in commit
`meta.RegressionOutlierDetector`	✅	Solved in #665
`meta.hierarchical_predictor.py`	✅	Solved in #667
`meta.grouped_transformer.py`	✅	Solved in #667
`meta.grouped_predictor.py`	✅	Solved in #667
`linear_models._FairClassifier`	✅	Solved in #669
`pandas_utils.py`	🚧	Partially #661	(*)
`datasets.py`	🚫		It would require `read_csv` function

Personally I would wait to have at least preprocessing.pandastransformers.py entire migration before bumping to v0.9.0.

cc @MarcoGorelli @anopsy

(*) Regarding `pandas_utils`

Changing log_step to narwhals is fairly easy (~4 lines of code), however as this decorator is supposed to work for any function that operates on pandas, doing so would limit its functionality. It could be reasonable to add another one which although restricted to narwhals methods, it can interoperate with all its compatible dataframes.

Legend

✅ Done
🚧 WIP
🔲 Not Started
🚫 Won't do

The text was updated successfully, but these errors were encountered:

anopsy · 2024-05-07T11:33:53Z

I'm on it

MarcoGorelli · 2024-05-07T12:04:45Z

Thanks for making this tracker!

I think grouping shouldn't be an issue, there's an example of that here: https://narwhals-dev.github.io/narwhals/basics/dataframe/#example-2-group-by-and-mean

If I remember correctly, TimeGapSplit relies on index alignment, so that one may need a bit more discussion

DeaMariaLeon · 2024-05-08T11:13:50Z

Before this (nice) tracker was opened, I was taking a look at using Narwhals on _FairClassifier per Marco's suggestion. So, I guess I should change that?

Also, I'm trying to run the tests but I'm having issues installing what's needed. Specifically, pystest is complaining becausecvxpyis not installed. That can't be installed because "Could not build wheels for osqp". I believe there is an issue with the macs with M1/M2. Where can I ask about it? (I don't find a better place than here, sorry)..
Friendly ping to @FBruzzesi, just in case. :-)

FBruzzesi · 2024-05-08T11:27:23Z

Hey @DeaMariaLeon, thanks for the ping. I forgot to add _FairClassifier to the list 🙈 Let me fix that in a moment.

Regarding cvxpy, I tried to do a quick search, but couldn't find related material. Maybe osqp build from source guide could help. Keep me posted on that 😇

koaning · 2024-05-08T12:22:16Z

One thing about that FairClassifier ... there may be a solid reason to consider porting that over to fairlearn at some point. The thinking here is that while scikit-lego is a cool place for somewhat experimental and fun components ... fairness might be a more serious topic so it might be a better home for our fairness tools. No formal decision has ever been made regarding this but it was something that was always in the back of people's mind.

DeaMariaLeon · 2024-05-08T13:33:58Z

I'll leave FairClassifier for later then. What I noticed is that check_X_y must be kept, as it checks that X is not a sparse matrix (IIUC). The implementation gets more complicated..

Regarding cvxpy, thanks for the help @FBruzzesi. Everything is installed now... But when running pytest, there are 14 errors. I wonder if I should ignore that (Running tests removing my changes):

ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes0-prior0-cfm0-0.973]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes1-prior1-cfm1-0.973]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes2-prior2-cfm2-0.973]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes3-prior3-cfm3-0.985]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes4-prior4-cfm4-1.0]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes5-prior5-cfm5-0.2]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes6-prior6-cfm6-0.1]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes7-prior7-cfm7-0.696]
ERROR tests/test_meta/test_subjective_classifier.py::test_predict_proba[predict_proba-expected_probas0]
ERROR tests/test_meta/test_subjective_classifier.py::test_predict_proba[confusion_matrix-expected_probas1]
ERROR tests/test_meta/test_subjective_classifier.py::test_predict_proba[both-expected_probas2]
ERROR tests/test_preprocessing/test_outlier_remover.py::test_no_outliers
ERROR tests/test_preprocessing/test_outlier_remover.py::test_remove_outlier
ERROR tests/test_preprocessing/test_outlier_remover.py::test_do_not_refit

DeaMariaLeon · 2024-05-08T14:05:46Z

Maybe we should write which function we are working on, so we don't overlap?

I would like to work on PandasTypeSelector if that's ok.

DeaMariaLeon · 2024-05-08T14:47:44Z

Scikit-learn has a very similar function than PandasTypeSelector: https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html

Is TypesSelector still needed then? (I imagine that it is, just making sure here).

koaning · 2024-05-08T14:58:40Z

Some of those tools were added 5-6 years ago back when sklearn did not support those features. Scikit is a lot further now but on the scikit-lego side of things we've kept things around since we didn't really see a reason to remove them. Since we're interested in giving narwhals a spin here I'd argue it'd probably be good to keep them around if only because the implementation exercise will also demonstrate that it can be used in other projects.

DeaMariaLeon · 2024-05-10T12:28:33Z

Narwhals is not ready to easily work for PandasTypeSelector yet.. per Marco's suggestion I went back to work with linear_models._FairClassifier.
@FBruzzesi

MarcoGorelli · 2024-05-11T14:25:27Z

Been looking into TimeGapSplit:

cv = TimeGapSplit(date_serie) is instantiated using a pd.Series which contains dates as values, and some index
cv.split(X, y) then aligns X with the index of date_serie

However, other than the index, X has no knowledge of how it relates to date_serie, so... 🤔 not sure how we're going to get this one to work for index-less libraries. Maybe for those, require that X is the same length as date_serie, and then align positionally?

FBruzzesi · 2024-05-11T16:07:38Z

@DeaMariaLeon sure, no problem for me! Let me know if I can help on the Narwhals side to make the missing bits available.
@MarcoGorelli TimeGapSplit does a few things in a bit of unconventional way imho.

Maybe for those, require that X is the same length as date_serie, and then align positionally?

This is a good assumption in my opinion (but I am biased as I do the same in timebasedcv)

FBruzzesi · 2024-05-11T16:54:01Z

@MarcoGorelli I may have made a commit bypassing review by accident 🙈. Could you take a look at it when you have the time?

It should make InformationFilter dataframe agnostic

MarcoGorelli · 2024-05-11T20:00:36Z

It should make InformationFilter dataframe agnostic

Nice, looks good! looks like it preserves the existing logic

This has been quite a group effort, and much progress has been made - at what point do you think a release might be warranted? I'm presenting Narwhals at PyCon Italia in the 23rd of May (in 12 days), would be pretty cool if I could give a live demo of scikit-lego 😎 I understand if you'd rather wait until there's a bit more though, no pressure!

FBruzzesi · 2024-05-12T09:31:08Z

@MarcoGorelli that would be grand! Let's try to make a release within this week (or next weekend at latest?!).

Personally I would love to be able to replace PandasTypeSelector as well and have a "Hey we replaced the entire dataframe module to be agnostic"-release 😂

koaning · 2024-05-12T11:05:11Z

would be pretty cool if I could give a live demo of scikit-lego 😎 I understand if you'd rather wait until there's a bit more though, no pressure!

@MarcoGorelli the stuff that's already been done seems sufficient to me as a "demo". It's mainly to report that folks are indeed working on the integration and that sofar it seems to work.

MarcoGorelli · 2024-05-12T14:29:16Z

Totally! I just meant, it would be nice to be able to demo this result:

In [15]: %timeit add_lags(df, ['a', 'b', 'c'], [1, 2, 3, 4, 5])
4.35 ms ± 206 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [16]: %timeit pl.from_pandas(add_lags(df.to_pandas(), ['a', 'b', 'c'], [1, 2, 3, 4, 5]))
140 ms ± 1.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [17]: type(df)
Out[17]: polars.dataframe.frame.DataFrame

and say something like "This is why "just convert to pandas" isn't a satisfying solution. Native Polars support brings you library to another level - you can use this today, go out and pip install -U sklego and pass in Polars dataframes as you please!"

Let's try to make a release within this week (or next weekend at latest?!).

Sounds great! Given the pace at which things are moving, it may be realistic to have the whole thing done by then. If in a couple of places, non-pandas input is converted to pandas under the hood (like the grouped predictor in #667) then TBH I think that's fine as a temporary solution. And I don't think it would be temporary for too long 🚀

MarcoGorelli · 2024-05-13T18:59:52Z

Consider creating another class TypeSelector and deprecate this one [PandasTypeSelector]

Been thinking about this one. I think what would be good to aim for would be:

backwards compatibility - pandas users currently using PandasTypeSelector shouldn't notice the change
simple to explain for non-pandas users
a single, consistent API

One suggestion I have is:

if the backing dataframe is pandas, then whatever the user passes goes straight to select_dtypes (no change from the status quo)
else, then the user can pass one of the following strings (and only these are supported)
- 'number'
- 'string'
- 'date'
- 'datetime'
- 'datetimetz'
- 'timedelta'
- 'category'
- 'int64', 'int32', 'int16', 'int8', 'float64', 'float32', 'uint64', 'uint32', 'uint16', 'uint8'
- 'bool'

Like this, scikit-lego keeps the same API for pandas and non-pandas users. For pandas, it keeps using select_dtypes (so, perfect backwards compatibility), whereas for non-pandas it uses Narwhals' selectors.by_dtype (which I still need to implement, but it shouldn't be too bad) under the hood but keeps the existing user-facing API

An alternative would be to let users passing in Narwhals dtypes, but I'm not sure that end users should know about Narwhals 🤔

FBruzzesi · 2024-05-18T08:15:23Z

I just merged #669 and #670. I would argue we are in a very good position for a release, I will open a PR to merge the dev branch into main and bump version.

I am sorry that I am a bit stuck on the GroupedPredictor for now, but it will get there soon(ish) 😁

MarcoGorelli · 2024-05-18T11:21:38Z

cool, thanks!

granted, I'm slightly biased as I'm quite keen to present this as "you can try this today!" at the conference, but to be honest I don't think GroupedPredictor should be a blocker - we can definitely solve it, I don't see any theoretical reasons we can't do it, it'll just take a little longer

FBruzzesi · 2024-05-25T08:03:46Z

Closed by #671
Thanks everyone! This was a huge effort, very appreciated and we are quite enthusiastic by the result!

koaning · 2024-05-25T11:10:50Z

Just so folks know, it's all live now. Might make sense to see if we can collaborate on making a splash with an announcement?

MarcoGorelli · 2024-06-18T12:50:18Z

Might make sense to see if we can collaborate on making a splash with an announcement?

Yup! The blog post is now live: https://labs.quansight.org/blog/scikit-lego-narwhals

Some posts I made:

FBruzzesi added the enhancement New feature or request label May 7, 2024

anopsy mentioned this issue May 8, 2024

Agnostic ColumnSelector #659

Merged

9 tasks

MarcoGorelli mentioned this issue May 12, 2024

Make Timegapsplit dataframe-agnostic #668

Merged

9 tasks

DeaMariaLeon mentioned this issue May 13, 2024

make FairClassifier data-agnostic #669

Merged

9 tasks

MarcoGorelli mentioned this issue May 14, 2024

Make PandasTypeSelector selector dataframe-agnostic #670

Merged

9 tasks

FBruzzesi mentioned this issue May 18, 2024

feat: Narwhals for dataframe-agnostic codebase #671

Merged

FBruzzesi mentioned this issue May 19, 2024

feat: Make grouped and hierarchical dataframe-agnostic #667

Merged

FBruzzesi closed this as completed May 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Narwhals migration for dataframe-agnostic codebase #658

[FEATURE] Narwhals migration for dataframe-agnostic codebase #658

FBruzzesi commented May 7, 2024 •

edited

Loading

anopsy commented May 7, 2024

MarcoGorelli commented May 7, 2024

DeaMariaLeon commented May 8, 2024

FBruzzesi commented May 8, 2024

koaning commented May 8, 2024

DeaMariaLeon commented May 8, 2024

DeaMariaLeon commented May 8, 2024

DeaMariaLeon commented May 8, 2024

koaning commented May 8, 2024

DeaMariaLeon commented May 10, 2024

MarcoGorelli commented May 11, 2024 •

edited

Loading

FBruzzesi commented May 11, 2024

FBruzzesi commented May 11, 2024

MarcoGorelli commented May 11, 2024

FBruzzesi commented May 12, 2024

koaning commented May 12, 2024

MarcoGorelli commented May 12, 2024 •

edited

Loading

MarcoGorelli commented May 13, 2024 •

edited

Loading

FBruzzesi commented May 18, 2024

MarcoGorelli commented May 18, 2024

FBruzzesi commented May 25, 2024

koaning commented May 25, 2024

MarcoGorelli commented Jun 18, 2024

[FEATURE] Narwhals migration for dataframe-agnostic codebase #658

[FEATURE] Narwhals migration for dataframe-agnostic codebase #658

Comments

FBruzzesi commented May 7, 2024 • edited Loading

Description

(*) Regarding pandas_utils

anopsy commented May 7, 2024

MarcoGorelli commented May 7, 2024

DeaMariaLeon commented May 8, 2024

FBruzzesi commented May 8, 2024

koaning commented May 8, 2024

DeaMariaLeon commented May 8, 2024

DeaMariaLeon commented May 8, 2024

DeaMariaLeon commented May 8, 2024

koaning commented May 8, 2024

DeaMariaLeon commented May 10, 2024

MarcoGorelli commented May 11, 2024 • edited Loading

FBruzzesi commented May 11, 2024

FBruzzesi commented May 11, 2024

MarcoGorelli commented May 11, 2024

FBruzzesi commented May 12, 2024

koaning commented May 12, 2024

MarcoGorelli commented May 12, 2024 • edited Loading

MarcoGorelli commented May 13, 2024 • edited Loading

FBruzzesi commented May 18, 2024

MarcoGorelli commented May 18, 2024

FBruzzesi commented May 25, 2024

koaning commented May 25, 2024

MarcoGorelli commented Jun 18, 2024

FBruzzesi commented May 7, 2024 •

edited

Loading

(*) Regarding `pandas_utils`

MarcoGorelli commented May 11, 2024 •

edited

Loading

MarcoGorelli commented May 12, 2024 •

edited

Loading

MarcoGorelli commented May 13, 2024 •

edited

Loading