Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF-#6762: Carry dtypes information in lazy indices #6763

Merged
merged 2 commits into from
Nov 21, 2023

Conversation

dchigarev
Copy link
Collaborator

@dchigarev dchigarev commented Nov 21, 2023

What do these changes do?

This PR adds ._dtypes property to ModinIndex, one can pass it to the object's constructor and indicate dtypes of index levels, even if the index values are yet unknown.

This was done mainly to preserve dtypes in the following case:

df.dtypes # known_dtypes: {"a": int, ...}
res = df.groupby("a", as_index=True).sum()
res # 'a' column is now in the index, but we lost its dtypes
res = res.reset_index(drop=False)
res.dtypes # cols_with_unknown_dtypes: ["a"]

With the changes in this PR, the presented scenario will preserve dtype for "a" column in the end.

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Support carrying dtypes in an index cache #6762
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

@dchigarev dchigarev marked this pull request as ready for review November 21, 2023 15:21
Comment on lines +4147 to +4148
and len(not_broadcastable_by) == 0
and len(broadcastable_by) == 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why such restrictions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These restrictions describe a simple case when 'by' are columns from the same dataframe. I don't want to handle other cases in this PR as pandas is always tricky in how it handles groupby with mixed by

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added an in-code comment about this

Copy link
Collaborator

@anmyachev anmyachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@anmyachev anmyachev merged commit b8323b5 into modin-project:master Nov 21, 2023
34 of 35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support carrying dtypes in an index cache
2 participants