FEAT-#5836: Introduce 'partial' dtypes cache #6663

dchigarev · 2023-10-20T10:20:35Z

What do these changes do?

This PR was brought in order to decrease the amount of expensive ._compute_dtypes() calls. This was done by advancing the mechanisms of how we store, use, and update the dtypes cache.

How it was before

Previously, dtypes cache only been able to store "complete" schemas, if we don't know the dtype for at least one column in the dataframe, we set the whole cache to be None.

How it is now

Partially known dtypes

The PR introduced a new class called DtypesDescriptor which is able to store partially known schemas and provides an API to work with them. It's then suggested, that if it's really required to materialize the complete schema, the DtypesDescriptor will only compute dtypes for the columns that we don't have any info about.

There could be several types of partially known schemas:

We know the schema for a subset of columns and don't know any other info about the rest of the columns:
DtypesDescriptor(known_dtypes={"a": ..., "b": ...}, know_all_names=False)
We know the schema for a subset of columns and also known names of columns with unknown dtypes:
DtypesDescriptor(known_dtypes={"a": ..., "b": ...}, cols_with_unknown_dtypes=["c", "d"])
We know the schema for a subset of columns and also known a dtype for the rest of the columns:
DtypesDescriptor(known_dtypes={"a": ..., "b": ...}, remaining_dtype=np.dtype(bool))
We only know a common dtype for the whole dataframe:
DtypesDescriptor(remaining_dtype=np.dtype(bool))
...

How to efficiently use them

It was found out, that there are a lot of cases we don't need to know the complete scheme of a dataframe. For example, we trigger the .dtypes field a lot in our front-end just to verify, whether a dataframe fully constist of numerical columns. It's obvious that we don't need to know the scheme for that, but only a set of dtypes. For that purpose, there was introduced a method called DtypesDescriptor.get_list_of_dtypes() that helped to eliminate ._compute_dtypes() calls in a few places of one of our workload that we're optimizing right now.

There's also a method called DtypesDescriptor.lazy_get(subset) that is able to take a subset of partially known dtypes. This is mainly used in masking. For example:

>>> df.columns # DtypesDescriptor(known_dtypes={"a": int, "b": float}, cols_with_unknown_dtypes=["c", "d"])
["a", "b", "c", "d"]
>>> subset = df[["a", "b"]] # DtypesDescriptor(known_dtypes={"a": int, "b": float}, all_cols_are_known=True)
>>>
>>> df2.columns # DtypesDescriptor(known_dtypes={"a": int, "b": float}, remaining_dtype=float)
["a", "b", "c", "d"]
>>> subset = df2[["b", "c", "d"]] # DtypesDescriptor(known_dtypes={"b": float, "c": float, "d": float}, all_cols_are_known=True)

There's also DtypesDescriptor.concat() method, that is able to merge partially known dtypes, this is mainly used on pd.concat(), df.__setitem__(), and df.insert().

At the current state, the following scenario is now able to work completely without triggering ._compute_dtypes() and also with the complete schema being known at the end (previously, there were several ._compute_dtypes() calls and the schema was still unknown in the result):

def test_get_dummies_case(self):
    with mock.patch.object(PandasDataframe, "_compute_dtypes") as patch:
        df = pd.DataFrame(
            {"items": [1, 2, 3, 4], "b": [3, 3, 4, 4], "c": [1, 0, 0, 1]}
        )
        res = pd.get_dummies(df, columns=["b", "c"])
        cols = [col for col in res.columns if col != "items"]
        res[cols] = res[cols] / res[cols].mean()
        assert res._query_compiler._modin_frame.has_materialized_dtypes
    patch.assert_not_called()

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Introduce dtypes cache that can have certain columns to be unknown #5836
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

modin/core/dataframe/pandas/metadata/dtypes.py

modin/core/dataframe/pandas/dataframe/dataframe.py

Signed-off-by: Dmitry Chigarev <[email protected]>

dchigarev · 2023-11-12T14:54:55Z

modin/core/dataframe/pandas/dataframe/dataframe.py

+        Returns
+        -------
+        pandas.Series, ModinDtypes or callable


return updated value for convenience

dchigarev · 2023-11-12T14:59:33Z

modin/core/dataframe/pandas/dataframe/dataframe.py

@@ -376,7 +403,13 @@ def dtype_builder(df):
        if columns is not None:
            # Sorting positions to request columns in the order they're stored (it's more efficient)
            numeric_indices = sorted(self.columns.get_indexer_for(columns))
-            obj = self._take_2d_positional(col_positions=numeric_indices)


_take_2d_positional doesn't apply deferred labels which failed some of the new tests, added this simple workaround until we figure out whether we need lazy_metadata_decorator there or not (#0000 TODO: raise an issue)

dchigarev · 2023-11-12T15:01:47Z

modin/core/dataframe/pandas/metadata/index.py

@@ -249,6 +249,22 @@ def __reduce__(self):
            },
        )

+    def __getitem__(self, key):


this method is required to use ModinIndex as a regular index

dchigarev · 2023-11-12T15:02:54Z

modin/core/storage_formats/pandas/query_compiler.py

@@ -4300,13 +4303,13 @@ def map_fn(df):  # pragma: no cover
        # than it would be to reuse the code for specific columns.
        if len(columns) == len(self.columns):
            new_modin_frame = self._modin_frame.apply_full_axis(
-                0, map_fn, new_index=self.index
+                0, map_fn, new_index=self.index, dtypes=bool


an example of how we can use 'remaining_dtype' functionality, later we should find more places where we can apply this

dchigarev · 2023-11-12T15:03:40Z

modin/experimental/core/execution/native/implementations/hdk_on_native/dataframe/dataframe.py

@@ -505,12 +505,12 @@ def _dtypes_for_exprs(self, exprs):
    @_inherit_docstrings(PandasDataframe._maybe_update_proxies)
    def _maybe_update_proxies(self, dtypes, new_parent=None):
        if new_parent is not None:
-            super()._maybe_update_proxies(dtypes, new_parent)
+            return super()._maybe_update_proxies(dtypes, new_parent)


we changed this method in PandasDataframe to return dtypes with an updated parent

dchigarev · 2023-11-12T15:05:30Z

modin/pandas/dataframe.py

@@ -2875,8 +2875,9 @@ def _validate_dtypes(self, numeric_only=False):
        # Series.__getitem__ treating keys as positions is deprecated. In a future version,
        # integer keys will always be treated as labels (consistent with DataFrame behavior).
        # To access a value by position, use `ser.iloc[pos]`
-        dtype = self.dtypes.iloc[0]
-        for t in self.dtypes:
+        dtypes = self._query_compiler.get_dtypes_set()


an example of how we can use get_dtypes_set() functionality, instead of materializing the whole schema, we only request a set of dtypes, there are more places like this in functions that do is_numeric_dtype() checks

dchigarev · 2023-11-12T15:06:51Z

modin/test/storage_formats/pandas/test_internals.py

@@ -1011,7 +1015,7 @@ def test_merge_preserves_metadata(has_cols_metadata, has_dtypes_metadata):

    if has_dtypes_metadata:
        # Verify that there were initially materialized metadata
-        assert modin_frame.has_dtypes_cache
+        assert modin_frame.has_materialized_dtypes


we now have some dtypes cache quite often, so the old check no longer works correctly, what it wants to check is whether we have materialized dtypes

Signed-off-by: Dmitry Chigarev <[email protected]>

dchigarev · 2023-11-12T17:01:01Z

@anmyachev @YarShev @AndreyPavlenko the PR is now ready for review

anmyachev

@dchigarev amazing changes! I need more time to finish the review, but it’s already clear that more tests are needed, since there are a lot of them :)

Overall LGTM, but I want to take a closer look at the some implementation details.

modin/core/dataframe/pandas/metadata/dtypes.py

Signed-off-by: Dmitry Chigarev <[email protected]>

modin/core/dataframe/pandas/metadata/dtypes.py

anmyachev · 2023-11-16T19:50:26Z

modin/core/dataframe/pandas/metadata/dtypes.py

+        ErrorMessage.catch_bugs_and_request_email(
+            failure_condition=not self.is_materialized
+        )
+        return ModinDtypes(self._value.iloc[ids] if numeric_index else self._value[ids])


@dchigarev False positive?

I think so, because this code is tested and works properly

modin/core/dataframe/pandas/metadata/dtypes.py

modin/core/dataframe/pandas/dataframe/dataframe.py

modin/core/dataframe/pandas/metadata/dtypes.py

Signed-off-by: Dmitry Chigarev <[email protected]>

anmyachev · 2023-11-16T18:24:55Z

modin/core/dataframe/pandas/metadata/dtypes.py

+        if len(self._columns_order) > (
+            len(self._known_dtypes) + len(self._cols_with_unknown_dtypes)
+        ):


Why is the "The length of 'columns_order' doesn't match to 'known_dtypes' and 'cols_with_unknown_dtypes'" exception thrown in the constructor when the lengths do not match, but here there is a recalculation?

Because if you know columns_order beforehand, you can and IMHO must complete known_dtypes and cols_with_unk... with the information you have at your own. The DtypesDescriptor constructor's parameter matrix is already quite complicated, and I didn't want to bring yet another degree of freedom like "oh, and you also can provide some incomplete argument and we'll infer everything magically from the rest of the arguments". I rather wanted to put as many limitations as possible in order to simplify the constructor's logic and avoid having potentially missed and unprocessed cases because of the bloated parameters matrix.

modin/core/dataframe/pandas/metadata/dtypes.py

anmyachev · 2023-11-16T19:55:41Z

modin/core/dataframe/pandas/metadata/dtypes.py

+            and set(self._cols_with_unknown_dtypes)
+            == set(other._cols_with_unknown_dtypes)


Why are you using set here?

just to ignore the order

We can ignore it here because we there is the following check: self.columns_order == other.columns_order?

nope, consider this example:

dt1 = DtypesDescriptor(cols_with_unknown_dtypes=["a", "b"], columns_order={0: "a", 1: "b"}) dt2 = DtypesDescriptor(cols_with_unknown_dtypes=["b", "a"], columns_order={0: "a", 1: "b"}) dt1.equals(dt2) # should be true dt1 = DtypesDescriptor(cols_with_unknown_dtypes=["a", "b"], columns_order=None) dt2 = DtypesDescriptor(cols_with_unknown_dtypes=["b", "a"], columns_order=None) dt1.equals(dt2) # should be true

I see, thanks

modin/core/dataframe/pandas/metadata/dtypes.py

Signed-off-by: Dmitry Chigarev <[email protected]>

modin/core/dataframe/pandas/dataframe/dataframe.py

Signed-off-by: Dmitry Chigarev <[email protected]>

anmyachev

LGTM!

dchigarev · 2023-11-16T20:21:37Z

LGTM!

@anmyachev just wanted to bring your attention to #6737, I want this to be merged first, as otherwise, the #6663 changes will carry incorrect dtype through the workload we're optimizing and it will result into perf degradation as it now would perform object-like math rather than float-like math

anmyachev · 2023-11-16T20:35:37Z

LGTM!

@anmyachev just wanted to bring your attention to #6737, I want this to be merged first, as otherwise, the #6663 changes will carry incorrect dtype through the workload we're optimizing and it will result into perf degradation as it now would perform object-like math rather than float-like math

Merged

Signed-off-by: Dmitry Chigarev <[email protected]>

github-advanced-security bot found potential problems Oct 20, 2023

View reviewed changes

modin/core/dataframe/pandas/metadata/dtypes.py Fixed Show fixed Hide fixed

modin/core/dataframe/pandas/dataframe/dataframe.py Fixed Show fixed Hide fixed

dchigarev changed the title ~~FEAT-#5836: Introduce 'partial' dtypes cache~~ WIP-FEAT-#5836: Introduce 'partial' dtypes cache Nov 6, 2023

dchigarev force-pushed the smart_dtypes branch from 481a7e1 to 00f911d Compare November 10, 2023 13:10

FEAT-modin-project#5836: Introduce 'partial' dtypes cache

068453d

Signed-off-by: Dmitry Chigarev <[email protected]>

dchigarev force-pushed the smart_dtypes branch from 00f911d to 068453d Compare November 12, 2023 14:51

dchigarev commented Nov 12, 2023

View reviewed changes

dchigarev changed the title ~~WIP-FEAT-#5836: Introduce 'partial' dtypes cache~~ FEAT-#5836: Introduce 'partial' dtypes cache Nov 12, 2023

dchigarev added 2 commits November 12, 2023 16:52

fixes

ac17c16

Signed-off-by: Dmitry Chigarev <[email protected]>

revert fix for 6732

84fc9f1

Signed-off-by: Dmitry Chigarev <[email protected]>

dchigarev marked this pull request as ready for review November 12, 2023 17:00

dchigarev requested review from aregm, gshimansky, ienkovich, Garra1980, YarShev, vnlitvinov, anmyachev, AndreyPavlenko, a team, devin-petersohn, mvashishtha and RehanSD as code owners November 12, 2023 17:00

anmyachev reviewed Nov 14, 2023

View reviewed changes

dchigarev added 2 commits November 16, 2023 13:19

add more tests

0b644fa

Signed-off-by: Dmitry Chigarev <[email protected]>

Merge remote-tracking branch 'origin' into smart_dtypes

e89d7b6