Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#5836: Introduce 'partial' dtypes cache #6663

Merged
merged 10 commits into from
Nov 17, 2023

Conversation

dchigarev
Copy link
Collaborator

@dchigarev dchigarev commented Oct 20, 2023

What do these changes do?

This PR was brought in order to decrease the amount of expensive ._compute_dtypes() calls. This was done by advancing the mechanisms of how we store, use, and update the dtypes cache.

How it was before

Previously, dtypes cache only been able to store "complete" schemas, if we don't know the dtype for at least one column in the dataframe, we set the whole cache to be None.

How it is now

Partially known dtypes

The PR introduced a new class called DtypesDescriptor which is able to store partially known schemas and provides an API to work with them. It's then suggested, that if it's really required to materialize the complete schema, the DtypesDescriptor will only compute dtypes for the columns that we don't have any info about.

There could be several types of partially known schemas:

  1. We know the schema for a subset of columns and don't know any other info about the rest of the columns:
    DtypesDescriptor(known_dtypes={"a": ..., "b": ...}, know_all_names=False)
  2. We know the schema for a subset of columns and also known names of columns with unknown dtypes:
    DtypesDescriptor(known_dtypes={"a": ..., "b": ...}, cols_with_unknown_dtypes=["c", "d"])
  3. We know the schema for a subset of columns and also known a dtype for the rest of the columns:
    DtypesDescriptor(known_dtypes={"a": ..., "b": ...}, remaining_dtype=np.dtype(bool))
  4. We only know a common dtype for the whole dataframe:
    DtypesDescriptor(remaining_dtype=np.dtype(bool))
  5. ...

How to efficiently use them

It was found out, that there are a lot of cases we don't need to know the complete scheme of a dataframe. For example, we trigger the .dtypes field a lot in our front-end just to verify, whether a dataframe fully constist of numerical columns. It's obvious that we don't need to know the scheme for that, but only a set of dtypes. For that purpose, there was introduced a method called DtypesDescriptor.get_list_of_dtypes() that helped to eliminate ._compute_dtypes() calls in a few places of one of our workload that we're optimizing right now.

There's also a method called DtypesDescriptor.lazy_get(subset) that is able to take a subset of partially known dtypes. This is mainly used in masking. For example:

>>> df.columns # DtypesDescriptor(known_dtypes={"a": int, "b": float}, cols_with_unknown_dtypes=["c", "d"])
["a", "b", "c", "d"]
>>> subset = df[["a", "b"]] # DtypesDescriptor(known_dtypes={"a": int, "b": float}, all_cols_are_known=True)
>>>
>>> df2.columns # DtypesDescriptor(known_dtypes={"a": int, "b": float}, remaining_dtype=float)
["a", "b", "c", "d"]
>>> subset = df2[["b", "c", "d"]] # DtypesDescriptor(known_dtypes={"b": float, "c": float, "d": float}, all_cols_are_known=True)

There's also DtypesDescriptor.concat() method, that is able to merge partially known dtypes, this is mainly used on pd.concat(), df.__setitem__(), and df.insert().

At the current state, the following scenario is now able to work completely without triggering ._compute_dtypes() and also with the complete schema being known at the end (previously, there were several ._compute_dtypes() calls and the schema was still unknown in the result):

def test_get_dummies_case(self):
    with mock.patch.object(PandasDataframe, "_compute_dtypes") as patch:
        df = pd.DataFrame(
            {"items": [1, 2, 3, 4], "b": [3, 3, 4, 4], "c": [1, 0, 0, 1]}
        )
        res = pd.get_dummies(df, columns=["b", "c"])
        cols = [col for col in res.columns if col != "items"]
        res[cols] = res[cols] / res[cols].mean()
        assert res._query_compiler._modin_frame.has_materialized_dtypes
    patch.assert_not_called()
  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Introduce dtypes cache that can have certain columns to be unknown #5836
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

@dchigarev dchigarev changed the title FEAT-#5836: Introduce 'partial' dtypes cache WIP-FEAT-#5836: Introduce 'partial' dtypes cache Nov 6, 2023
Comment on lines +319 to +321
Returns
-------
pandas.Series, ModinDtypes or callable
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return updated value for convenience

@@ -376,7 +403,13 @@ def dtype_builder(df):
if columns is not None:
# Sorting positions to request columns in the order they're stored (it's more efficient)
numeric_indices = sorted(self.columns.get_indexer_for(columns))
obj = self._take_2d_positional(col_positions=numeric_indices)
Copy link
Collaborator Author

@dchigarev dchigarev Nov 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_take_2d_positional doesn't apply deferred labels which failed some of the new tests, added this simple workaround until we figure out whether we need lazy_metadata_decorator there or not (#0000 TODO: raise an issue)

@@ -249,6 +249,22 @@ def __reduce__(self):
},
)

def __getitem__(self, key):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is required to use ModinIndex as a regular index

@@ -4300,13 +4303,13 @@ def map_fn(df): # pragma: no cover
# than it would be to reuse the code for specific columns.
if len(columns) == len(self.columns):
new_modin_frame = self._modin_frame.apply_full_axis(
0, map_fn, new_index=self.index
0, map_fn, new_index=self.index, dtypes=bool
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an example of how we can use 'remaining_dtype' functionality, later we should find more places where we can apply this

@@ -505,12 +505,12 @@ def _dtypes_for_exprs(self, exprs):
@_inherit_docstrings(PandasDataframe._maybe_update_proxies)
def _maybe_update_proxies(self, dtypes, new_parent=None):
if new_parent is not None:
super()._maybe_update_proxies(dtypes, new_parent)
return super()._maybe_update_proxies(dtypes, new_parent)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we changed this method in PandasDataframe to return dtypes with an updated parent

@@ -2875,8 +2875,9 @@ def _validate_dtypes(self, numeric_only=False):
# Series.__getitem__ treating keys as positions is deprecated. In a future version,
# integer keys will always be treated as labels (consistent with DataFrame behavior).
# To access a value by position, use `ser.iloc[pos]`
dtype = self.dtypes.iloc[0]
for t in self.dtypes:
dtypes = self._query_compiler.get_dtypes_set()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an example of how we can use get_dtypes_set() functionality, instead of materializing the whole schema, we only request a set of dtypes, there are more places like this in functions that do is_numeric_dtype() checks

@@ -1011,7 +1015,7 @@ def test_merge_preserves_metadata(has_cols_metadata, has_dtypes_metadata):

if has_dtypes_metadata:
# Verify that there were initially materialized metadata
assert modin_frame.has_dtypes_cache
assert modin_frame.has_materialized_dtypes
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we now have some dtypes cache quite often, so the old check no longer works correctly, what it wants to check is whether we have materialized dtypes

@dchigarev dchigarev changed the title WIP-FEAT-#5836: Introduce 'partial' dtypes cache FEAT-#5836: Introduce 'partial' dtypes cache Nov 12, 2023
Signed-off-by: Dmitry Chigarev <[email protected]>
Signed-off-by: Dmitry Chigarev <[email protected]>
@dchigarev
Copy link
Collaborator Author

@anmyachev @YarShev @AndreyPavlenko the PR is now ready for review

Copy link
Collaborator

@anmyachev anmyachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dchigarev amazing changes! I need more time to finish the review, but it’s already clear that more tests are needed, since there are a lot of them :)

Overall LGTM, but I want to take a closer look at the some implementation details.

ErrorMessage.catch_bugs_and_request_email(
failure_condition=not self.is_materialized
)
return ModinDtypes(self._value.iloc[ids] if numeric_index else self._value[ids])

Check failure

Code scanning / CodeQL

Unhashable object hashed Error

This
instance
of
list
is unhashable.
This
instance
of
list
is unhashable.
This
instance
of
list
is unhashable.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dchigarev False positive?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, because this code is tested and works properly

Signed-off-by: Dmitry Chigarev <[email protected]>
Comment on lines +164 to +166
if len(self._columns_order) > (
len(self._known_dtypes) + len(self._cols_with_unknown_dtypes)
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the "The length of 'columns_order' doesn't match to 'known_dtypes' and 'cols_with_unknown_dtypes'" exception thrown in the constructor when the lengths do not match, but here there is a recalculation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because if you know columns_order beforehand, you can and IMHO must complete known_dtypes and cols_with_unk... with the information you have at your own. The DtypesDescriptor constructor's parameter matrix is already quite complicated, and I didn't want to bring yet another degree of freedom like "oh, and you also can provide some incomplete argument and we'll infer everything magically from the rest of the arguments". I rather wanted to put as many limitations as possible in order to simplify the constructor's logic and avoid having potentially missed and unprocessed cases because of the bloated parameters matrix.

Comment on lines +330 to +331
and set(self._cols_with_unknown_dtypes)
== set(other._cols_with_unknown_dtypes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using set here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to ignore the order

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can ignore it here because we there is the following check: self.columns_order == other.columns_order?

Copy link
Collaborator Author

@dchigarev dchigarev Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, consider this example:

dt1 = DtypesDescriptor(cols_with_unknown_dtypes=["a", "b"], columns_order={0: "a", 1: "b"})
dt2 = DtypesDescriptor(cols_with_unknown_dtypes=["b", "a"], columns_order={0: "a", 1: "b"})
dt1.equals(dt2) # should be true

dt1 = DtypesDescriptor(cols_with_unknown_dtypes=["a", "b"], columns_order=None)
dt2 = DtypesDescriptor(cols_with_unknown_dtypes=["b", "a"], columns_order=None)
dt1.equals(dt2) # should be true

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks

Signed-off-by: Dmitry Chigarev <[email protected]>
Signed-off-by: Dmitry Chigarev <[email protected]>
anmyachev
anmyachev previously approved these changes Nov 16, 2023
Copy link
Collaborator

@anmyachev anmyachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@dchigarev
Copy link
Collaborator Author

dchigarev commented Nov 16, 2023

LGTM!

@anmyachev just wanted to bring your attention to #6737, I want this to be merged first, as otherwise, the #6663 changes will carry incorrect dtype through the workload we're optimizing and it will result into perf degradation as it now would perform object-like math rather than float-like math

@anmyachev
Copy link
Collaborator

LGTM!

@anmyachev just wanted to bring your attention to #6737, I want this to be merged first, as otherwise, the #6663 changes will carry incorrect dtype through the workload we're optimizing and it will result into perf degradation as it now would perform object-like math rather than float-like math

Merged

Signed-off-by: Dmitry Chigarev <[email protected]>
@anmyachev anmyachev merged commit b7bf9b5 into modin-project:master Nov 17, 2023
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce dtypes cache that can have certain columns to be unknown
2 participants