PERF-#6762: Carry dtypes information in lazy indices #6763

dchigarev · 2023-11-21T14:56:15Z

What do these changes do?

This PR adds ._dtypes property to ModinIndex, one can pass it to the object's constructor and indicate dtypes of index levels, even if the index values are yet unknown.

This was done mainly to preserve dtypes in the following case:

df.dtypes # known_dtypes: {"a": int, ...}
res = df.groupby("a", as_index=True).sum()
res # 'a' column is now in the index, but we lost its dtypes
res = res.reset_index(drop=False)
res.dtypes # cols_with_unknown_dtypes: ["a"]

With the changes in this PR, the presented scenario will preserve dtype for "a" column in the end.

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Support carrying dtypes in an index cache #6762
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

Signed-off-by: Dmitry Chigarev <[email protected]>

anmyachev · 2023-11-21T17:06:52Z

modin/core/storage_formats/pandas/query_compiler.py

+            and len(not_broadcastable_by) == 0
+            and len(broadcastable_by) == 1


Why such restrictions?

These restrictions describe a simple case when 'by' are columns from the same dataframe. I don't want to handle other cases in this PR as pandas is always tricky in how it handles groupby with mixed by

added an in-code comment about this

modin/core/storage_formats/pandas/query_compiler.py

anmyachev

LGTM!

PERF-modin-project#6762: Carry dtypes information in lazy indices

c084c61

Signed-off-by: Dmitry Chigarev <[email protected]>

dchigarev marked this pull request as ready for review November 21, 2023 15:21

dchigarev requested review from devin-petersohn, mvashishtha, RehanSD, YarShev, vnlitvinov, anmyachev and a team as code owners November 21, 2023 15:21

anmyachev reviewed Nov 21, 2023

View reviewed changes

dchigarev commented Nov 21, 2023

View reviewed changes

modin/core/storage_formats/pandas/query_compiler.py Show resolved Hide resolved

Update modin/core/storage_formats/pandas/query_compiler.py

87363b5

anmyachev approved these changes Nov 21, 2023

View reviewed changes

anmyachev merged commit b8323b5 into modin-project:master Nov 21, 2023
34 of 35 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF-#6762: Carry dtypes information in lazy indices #6763

PERF-#6762: Carry dtypes information in lazy indices #6763

dchigarev commented Nov 21, 2023 •

edited

Loading

anmyachev Nov 21, 2023

dchigarev Nov 21, 2023

dchigarev Nov 21, 2023

anmyachev left a comment

		and len(not_broadcastable_by) == 0
		and len(broadcastable_by) == 1

PERF-#6762: Carry dtypes information in lazy indices #6763

PERF-#6762: Carry dtypes information in lazy indices #6763

Conversation

dchigarev commented Nov 21, 2023 • edited Loading

What do these changes do?

anmyachev Nov 21, 2023

Choose a reason for hiding this comment

dchigarev Nov 21, 2023

Choose a reason for hiding this comment

dchigarev Nov 21, 2023

Choose a reason for hiding this comment

anmyachev left a comment

Choose a reason for hiding this comment

dchigarev commented Nov 21, 2023 •

edited

Loading