Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF-#6696: Use cached dtypes in fillna when possible. #6697

Merged
merged 2 commits into from
Nov 13, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 20 additions & 1 deletion modin/core/storage_formats/pandas/query_compiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -2420,6 +2420,7 @@ def fillna(self, **kwargs):
method = kwargs.get("method", None)
limit = kwargs.get("limit", None)
full_axis = method is not None or limit is not None
new_dtypes = None
if isinstance(value, BaseQueryCompiler):
if squeeze_self:
# Self is a Series type object
Expand Down Expand Up @@ -2487,15 +2488,33 @@ def fillna(df):
}
return df.fillna(value=func_dict, **kwargs)

if self._modin_frame.has_materialized_dtypes:
dtypes = self._modin_frame.dtypes
value_dtypes = pandas.DataFrame(
{k: [v] for (k, v) in value.items()}
).dtypes
if all(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_common_type([dtypes[col], dtype]) == dtypes[col]
for (col, dtype) in value_dtypes.items()
if col in dtypes
):
new_dtypes = dtypes

else:
if self._modin_frame.has_materialized_dtypes:
dtype = pandas.Series(value).dtype
if all(
find_common_type([t, dtype]) == t for t in self._modin_frame.dtypes
):
new_dtypes = self._modin_frame.dtypes
Comment on lines +2505 to +2509
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we simply fill new dtypes with the find_common_type results?

Suggested change
dtype = pandas.Series(value).dtype
if all(
find_common_type([t, dtype]) == t for t in self._modin_frame.dtypes
):
new_dtypes = self._modin_frame.dtypes
dtype = pandas.Series(value).dtype
new_dtypes = pandas.Series({col: find_common_type([t, dtype]) for col, t in self._modin_frame.dtypes.items()})

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the new dtype depends on whether there are NAs in the frame or not. If there no NAs, the dtype is not changed, otherwise, it's changed to the common. We can't analyze the data here, but we can make the assumption, that the dtype will not changed if it's already the common one.


def fillna(df):
return df.fillna(value=value, **kwargs)

if full_axis:
new_modin_frame = self._modin_frame.fold(axis, fillna)
else:
new_modin_frame = self._modin_frame.map(fillna)
new_modin_frame = self._modin_frame.map(fillna, dtypes=new_dtypes)
return self.__constructor__(new_modin_frame)

def quantile_for_list_of_values(self, **kwargs):
Expand Down
Loading