MAINT use composition in TableVectorizer #675

glemaitre · 2023-07-21T09:47:33Z

closes #660

This is a POC for #660 that uses internally a ColumnTransformer instead of inheriting from it.

LilianBoulard

Thanks for the PoC, the implementation looks good!

skrub/_table_vectorizer.py

glemaitre · 2023-07-21T16:42:20Z

skrub/_table_vectorizer.py

+    @property
+    def named_transformers_(self):
+        return self._column_transformer.named_transformers_
+
+    @property
+    def sparse_output_(self):
+        return self._column_transformer.sparse_output_
+
+    @property
+    def output_indices_(self):
+        return self._column_transformer.output_indices_
+


We need tests to check those attributes.

LilianBoulard · 2023-07-28T10:32:08Z

I think you need to merge with main for the tests to run :)

…ve_table_vectorizer # Conflicts: # skrub/_table_vectorizer.py # skrub/tests/test_table_vectorizer.py

Vincent-Maladiere · 2023-08-31T14:17:12Z

skrub/_table_vectorizer.py

+        # TODO: _check_feature_names raises a warning when fitting on dataframe
+        # but transforming on a numpy array.
+        # In practice, this looks error-prone and we need to discuss
+        # whether to raise an error instead.
+        #
+        # Note that when fitting on a dataframe and transforming on
+        # the same dataframe with different column names,
+        # _check_feature_names will raise an error.
+        self._check_feature_names(X, reset=reset)
+        feature_names = _get_feature_names(X)
+        feature_names_in = getattr(self, "feature_names_in_", None)
+        if feature_names is None and feature_names_in is not None:
+            X.columns = feature_names_in


Following our IRL discussion, @glemaitre

Vincent-Maladiere · 2023-09-01T16:28:41Z

skrub/_table_vectorizer.py

@LeoGrin and @glemaitre this should fix #709! This is very similar to what is done in ColumnTransformer

glemaitre · 2023-09-27T13:15:08Z

Merging #592 makes this design more complex. I will probably restart from scratch since we need to handle the split/merge of the transformer that are parallelized.

Vincent-Maladiere · 2023-10-16T12:19:08Z

Should we close this PR @glemaitre?

glemaitre · 2023-10-17T14:01:41Z

It will be automatically close when #761 will be merged.

glemaitre added 20 commits July 18, 2023 14:38

MAINT activate common test sklearn

40004e0

iter

2299f09

Merge remote-tracking branch 'origin/main' into common_test

b524edd

TST make GapEncoder compatible with scikit-learn

57904da

iter

4d6602c

SimilarityEncoder compat

2f0cb58

DatetimeEncoder support

2efa4ad

iter

7c379e5

iter

b45f48c

iter

eae158a

fix ci

0f778b0

iter

a3c2255

iter

37c75e8

iter

92087c9

Merge remote-tracking branch 'origin/main' into improve_table_vectorizer

837920f

Merge remote-tracking branch 'origin/main' into improve_table_vectorizer

ba9e28b

MAINT use composition in TableVectorizer

1ae54eb

iter

4cf8806

iter

69a8082

iter

e56f922

LilianBoulard reviewed Jul 21, 2023

View reviewed changes

skrub/_table_vectorizer.py Outdated Show resolved Hide resolved

LilianBoulard assigned glemaitre and LilianBoulard Jul 21, 2023

LilianBoulard added the enhancement New feature or request label Jul 21, 2023

pep8

39c1d23

LilianBoulard added the no changelog needed label Jul 21, 2023

glemaitre mentioned this pull request Jul 21, 2023

FIX make sure to clone remainder in TableVectorizer #678

Merged

glemaitre added 2 commits July 21, 2023 18:40

iter

2ec22f6

iter

3b43b2b

glemaitre commented Jul 21, 2023

View reviewed changes

LilianBoulard added 2 commits August 18, 2023 12:23

Merge branch 'main' of https://github.com/skrub-data/skrub into impro…

d79cace

…ve_table_vectorizer # Conflicts: # skrub/_table_vectorizer.py # skrub/tests/test_table_vectorizer.py

Clean error

cb8ad3b

LilianBoulard mentioned this pull request Aug 27, 2023

Fix TableVectorizer get_feature_names_out #722

Merged

Vincent-Maladiere mentioned this pull request Aug 30, 2023

MAINT Fix feature_name warning during transform for MinHashEncoder #725

Merged

Vincent-Maladiere added 4 commits August 30, 2023 17:45

remove ._columns from table_vectorizer

6b5e6d3

Merge branch 'main' into improve_table_vectorizer

bfa8699

fix tests because I removed 'self.columns_' earlier

eb793e8

Merge branch 'main' into improve_table_vectorizer

a57dfa0

Vincent-Maladiere reviewed Aug 31, 2023

View reviewed changes

Vincent-Maladiere added 2 commits August 31, 2023 16:53

add properties tests

5102457

add docstring to properties

41c5cc6

Vincent-Maladiere mentioned this pull request Sep 1, 2023

Initialize self.transformers in init to fix grid search for TableVectorizer #731

Closed

add get_params and set_params to enable grid_search

c0ef079

Vincent-Maladiere reviewed Sep 1, 2023

View reviewed changes

jovan-stojanovic approved these changes Sep 6, 2023

View reviewed changes

glemaitre mentioned this pull request Sep 27, 2023

MAINT use composition in TableVectorizer contn'd #761

Merged

jeromedockes closed this in #761 Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT use composition in TableVectorizer #675

MAINT use composition in TableVectorizer #675

glemaitre commented Jul 21, 2023

LilianBoulard left a comment

glemaitre Jul 21, 2023

LilianBoulard commented Jul 28, 2023

Vincent-Maladiere Aug 31, 2023

Vincent-Maladiere Sep 1, 2023

glemaitre commented Sep 27, 2023

Vincent-Maladiere commented Oct 16, 2023

glemaitre commented Oct 17, 2023

MAINT use composition in TableVectorizer #675

MAINT use composition in TableVectorizer #675

Conversation

glemaitre commented Jul 21, 2023

LilianBoulard left a comment

Choose a reason for hiding this comment

glemaitre Jul 21, 2023

Choose a reason for hiding this comment

LilianBoulard commented Jul 28, 2023

Vincent-Maladiere Aug 31, 2023

Choose a reason for hiding this comment

Vincent-Maladiere Sep 1, 2023

Choose a reason for hiding this comment

glemaitre commented Sep 27, 2023

Vincent-Maladiere commented Oct 16, 2023

glemaitre commented Oct 17, 2023