-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#5836: Introduce 'partial' dtypes cache #6663
Conversation
481a7e1
to
00f911d
Compare
Signed-off-by: Dmitry Chigarev <[email protected]>
00f911d
to
068453d
Compare
Returns | ||
------- | ||
pandas.Series, ModinDtypes or callable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return updated value for convenience
@@ -376,7 +403,13 @@ def dtype_builder(df): | |||
if columns is not None: | |||
# Sorting positions to request columns in the order they're stored (it's more efficient) | |||
numeric_indices = sorted(self.columns.get_indexer_for(columns)) | |||
obj = self._take_2d_positional(col_positions=numeric_indices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_take_2d_positional
doesn't apply deferred labels which failed some of the new tests, added this simple workaround until we figure out whether we need lazy_metadata_decorator
there or not (#0000 TODO: raise an issue)
@@ -249,6 +249,22 @@ def __reduce__(self): | |||
}, | |||
) | |||
|
|||
def __getitem__(self, key): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method is required to use ModinIndex
as a regular index
@@ -4300,13 +4303,13 @@ def map_fn(df): # pragma: no cover | |||
# than it would be to reuse the code for specific columns. | |||
if len(columns) == len(self.columns): | |||
new_modin_frame = self._modin_frame.apply_full_axis( | |||
0, map_fn, new_index=self.index | |||
0, map_fn, new_index=self.index, dtypes=bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an example of how we can use 'remaining_dtype' functionality, later we should find more places where we can apply this
@@ -505,12 +505,12 @@ def _dtypes_for_exprs(self, exprs): | |||
@_inherit_docstrings(PandasDataframe._maybe_update_proxies) | |||
def _maybe_update_proxies(self, dtypes, new_parent=None): | |||
if new_parent is not None: | |||
super()._maybe_update_proxies(dtypes, new_parent) | |||
return super()._maybe_update_proxies(dtypes, new_parent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we changed this method in PandasDataframe
to return dtypes with an updated parent
@@ -2875,8 +2875,9 @@ def _validate_dtypes(self, numeric_only=False): | |||
# Series.__getitem__ treating keys as positions is deprecated. In a future version, | |||
# integer keys will always be treated as labels (consistent with DataFrame behavior). | |||
# To access a value by position, use `ser.iloc[pos]` | |||
dtype = self.dtypes.iloc[0] | |||
for t in self.dtypes: | |||
dtypes = self._query_compiler.get_dtypes_set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an example of how we can use get_dtypes_set()
functionality, instead of materializing the whole schema, we only request a set of dtypes, there are more places like this in functions that do is_numeric_dtype()
checks
@@ -1011,7 +1015,7 @@ def test_merge_preserves_metadata(has_cols_metadata, has_dtypes_metadata): | |||
|
|||
if has_dtypes_metadata: | |||
# Verify that there were initially materialized metadata | |||
assert modin_frame.has_dtypes_cache | |||
assert modin_frame.has_materialized_dtypes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we now have some dtypes cache quite often, so the old check no longer works correctly, what it wants to check is whether we have materialized dtypes
Signed-off-by: Dmitry Chigarev <[email protected]>
Signed-off-by: Dmitry Chigarev <[email protected]>
@anmyachev @YarShev @AndreyPavlenko the PR is now ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchigarev amazing changes! I need more time to finish the review, but it’s already clear that more tests are needed, since there are a lot of them :)
Overall LGTM, but I want to take a closer look at the some implementation details.
Signed-off-by: Dmitry Chigarev <[email protected]>
ErrorMessage.catch_bugs_and_request_email( | ||
failure_condition=not self.is_materialized | ||
) | ||
return ModinDtypes(self._value.iloc[ids] if numeric_index else self._value[ids]) |
Check failure
Code scanning / CodeQL
Unhashable object hashed Error
instance
list
This
instance
list
This
instance
list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchigarev False positive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, because this code is tested and works properly
Signed-off-by: Dmitry Chigarev <[email protected]>
if len(self._columns_order) > ( | ||
len(self._known_dtypes) + len(self._cols_with_unknown_dtypes) | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the "The length of 'columns_order' doesn't match to 'known_dtypes' and 'cols_with_unknown_dtypes'"
exception thrown in the constructor when the lengths do not match, but here there is a recalculation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because if you know columns_order
beforehand, you can and IMHO must complete known_dtypes
and cols_with_unk...
with the information you have at your own. The DtypesDescriptor
constructor's parameter matrix is already quite complicated, and I didn't want to bring yet another degree of freedom like "oh, and you also can provide some incomplete argument and we'll infer everything magically from the rest of the arguments". I rather wanted to put as many limitations as possible in order to simplify the constructor's logic and avoid having potentially missed and unprocessed cases because of the bloated parameters matrix.
and set(self._cols_with_unknown_dtypes) | ||
== set(other._cols_with_unknown_dtypes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you using set
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to ignore the order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can ignore it here because we there is the following check: self.columns_order == other.columns_order
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, consider this example:
dt1 = DtypesDescriptor(cols_with_unknown_dtypes=["a", "b"], columns_order={0: "a", 1: "b"})
dt2 = DtypesDescriptor(cols_with_unknown_dtypes=["b", "a"], columns_order={0: "a", 1: "b"})
dt1.equals(dt2) # should be true
dt1 = DtypesDescriptor(cols_with_unknown_dtypes=["a", "b"], columns_order=None)
dt2 = DtypesDescriptor(cols_with_unknown_dtypes=["b", "a"], columns_order=None)
dt1.equals(dt2) # should be true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks
Signed-off-by: Dmitry Chigarev <[email protected]>
Signed-off-by: Dmitry Chigarev <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@anmyachev just wanted to bring your attention to #6737, I want this to be merged first, as otherwise, the #6663 changes will carry incorrect dtype through the workload we're optimizing and it will result into perf degradation as it now would perform object-like math rather than float-like math |
Merged |
Signed-off-by: Dmitry Chigarev <[email protected]>
What do these changes do?
This PR was brought in order to decrease the amount of expensive
._compute_dtypes()
calls. This was done by advancing the mechanisms of how we store, use, and update the dtypes cache.How it was before
Previously, dtypes cache only been able to store "complete" schemas, if we don't know the dtype for at least one column in the dataframe, we set the whole cache to be
None
.How it is now
Partially known dtypes
The PR introduced a new class called
DtypesDescriptor
which is able to store partially known schemas and provides an API to work with them. It's then suggested, that if it's really required to materialize the complete schema, theDtypesDescriptor
will only compute dtypes for the columns that we don't have any info about.There could be several types of partially known schemas:
DtypesDescriptor(known_dtypes={"a": ..., "b": ...}, know_all_names=False)
DtypesDescriptor(known_dtypes={"a": ..., "b": ...}, cols_with_unknown_dtypes=["c", "d"])
DtypesDescriptor(known_dtypes={"a": ..., "b": ...}, remaining_dtype=np.dtype(bool))
DtypesDescriptor(remaining_dtype=np.dtype(bool))
How to efficiently use them
It was found out, that there are a lot of cases we don't need to know the complete scheme of a dataframe. For example, we trigger the
.dtypes
field a lot in our front-end just to verify, whether a dataframe fully constist of numerical columns. It's obvious that we don't need to know the scheme for that, but only a set of dtypes. For that purpose, there was introduced a method calledDtypesDescriptor.get_list_of_dtypes()
that helped to eliminate._compute_dtypes()
calls in a few places of one of our workload that we're optimizing right now.There's also a method called
DtypesDescriptor.lazy_get(subset)
that is able to take a subset of partially known dtypes. This is mainly used in masking. For example:There's also
DtypesDescriptor.concat()
method, that is able to merge partially known dtypes, this is mainly used onpd.concat()
,df.__setitem__()
, anddf.insert()
.At the current state, the following scenario is now able to work completely without triggering
._compute_dtypes()
and also with the complete schema being known at the end (previously, there were several._compute_dtypes()
calls and the schema was still unknown in the result):flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date