Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF-#5533: Improved sort_values by reducing the number of partitions #6589

Merged
merged 5 commits into from
Sep 29, 2023

Conversation

AndreyPavlenko
Copy link
Collaborator

@AndreyPavlenko AndreyPavlenko commented Sep 19, 2023

  1. In groupby_reduce() num_splits is limited by the number of partititons. We assume here, that gorupby should not increase the current data size.
  2. Added a heuristic to the text file reader for calculating the number of partitions:
    num_partitions = min((num_rows * num_cols) // 64_000, NPartitions.get())
    An approximate number of rows is estimated by reading the first 10 lines.

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves [PERF] Slow sort_values in value_counts #5533
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

@AndreyPavlenko AndreyPavlenko force-pushed the issue-5533 branch 2 times, most recently from e79e325 to ff03c40 Compare September 24, 2023 15:08
@AndreyPavlenko AndreyPavlenko marked this pull request as ready for review September 24, 2023 16:14
@AndreyPavlenko AndreyPavlenko requested a review from a team as a code owner September 24, 2023 16:14
…of partitions

1. In groupby_reduce() num_splits is limited by the number of partititons.
   We assume here, that gorupby should not increase the current data size.
2. Added a heuristic to the text file reader for calculating the number of
   partitions:
     num_partitions = min((num_rows * num_cols) // 64_000, NPartitions.get())
   An approximate number of rows is estimated by reading the first 10 lines.

Signed-off-by: Andrey Pavlenko <[email protected]>
@anmyachev
Copy link
Collaborator

@AndreyPavlenko could you provide perf results for reproducer from #5533 (comment)?

@AndreyPavlenko
Copy link
Collaborator Author

@AndreyPavlenko could you provide perf results for reproducer from #5533 (comment)?

This reproducer requires another one fix for read_csv(), that was dropped in the last commit - AndreyPavlenko@efd91bf#diff-b73ae9581d0213011834cbe1316a85876e77c2bf00b5d93e9e05be078699f04fR1090 . It was decided to implement a separate solution for read_csv().

@anmyachev
Copy link
Collaborator

@AndreyPavlenko could you provide perf results for reproducer from #5533 (comment)?

This reproducer requires another one fix for read_csv(), that was dropped in the last commit - AndreyPavlenko@efd91bf#diff-b73ae9581d0213011834cbe1316a85876e77c2bf00b5d93e9e05be078699f04fR1090 . It was decided to implement a separate solution for read_csv().

Should we create another issue in this case?

@AndreyPavlenko
Copy link
Collaborator Author

Should we create another issue in this case?

#6616

Copy link
Collaborator

@anmyachev anmyachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@anmyachev anmyachev merged commit 65ad735 into modin-project:master Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[PERF] Slow sort_values in value_counts
3 participants