-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#7337: Using dynamic partitionning in broadcast_apply #7338
Conversation
3fcff00
to
70975bc
Compare
# FIXME: this is a naive workaround for this problem: https://github.com/modin-project/modin/issues/5394 | ||
# if there are too many partitions then all non-full-axis implementations start acting very badly. | ||
# The here threshold is pretty random though it works fine on simple scenarios | ||
processable_amount_of_partitions = ( | ||
self._modin_frame.num_parts < CpuCount.get() * 32 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you can see from this comment, this condition is a naive workaround for a problem that is solved by dynamic partitioning in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was moved back with different message.
# The `broadcast_apply` runtime condition differs from | ||
# the same condition in `map_partitions` because the columnar | ||
# approach for `broadcast_apply` results in a slowdown. | ||
if np.prod(left.shape) <= 1.5 * CpuCount.get(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we have the same condition, haven't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, because in map_partitions
we use 3 approaches: base_map_partitions
, map_partitions_joined_by_column
and map_axis_partitions
.
In broadcast_apply
we use only 2 approaches: base_broadcast_apply
and broadcast_axis_partitions
because broadcast_joined_by_axis
does not work as well as for map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition changed using a new environment variable DynamicPartitioning
Can you extend the table with measurements in the PR description? dropna, groupby? |
Actual result from this PRfillna
dropna
groupby.sum
Previous results from mainfillna
dropna
groupby.sum
Explanatory noteAs you can see, fillna and groupby results seem expected and this PR shows good speedup, but dropna results seem strange. This is because PR #6472 contained improvements for dropna, but issue#5394 affected it and the new approach only applied to dataframes that have less than 32 column partitions (1024 columns when using the default MinPartitionSize). This PR fixes that problem, but it doesn't help because the old way looks better for data frames with more columns. |
@Retribution98 do you plan to preserve that path?
Is the issue number correct? If you plan to keep the fix, it is better to put it in a separate pull request. |
if parts: | ||
result = pandas.concat(parts, axis=1, copy=False) | ||
else: | ||
result = pandas.DataFrame(columns=result_columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it important to keep the column names in case of empty partitions? How did you come to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do it to get the expectable result, because result columns are known already. Would it bring any problems or slowdowns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to make sure that Modin's behavior matches that of a pandas. Do we have a test for this new code?
# FIXME: this is a naive workaround for this problem: https://github.com/modin-project/modin/issues/5394 | ||
# if there are too many partitions then all non-full-axis implementations start acting very badly. | ||
# The here threshold is pretty random though it works fine on simple scenarios | ||
# The map reduce approach works well for frames with few columnar partitions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# The map reduce approach works well for frames with few columnar partitions | |
# The map reduce approach works well for frames with few row or column partitions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not correct. I got the follows results:
If Dataframe has few rows (less than 10^7) the old way show a better performance regardless of the number of columns.
If Dataframe has plenty of rows (more than 10^7) and few columns the map reduce approach is better.
If Dataframe has plenty of rows (more than 10^7) and plenty of columns the old approach is better again.
But I didn't add any new conditions here, because this is beyond the scope of the current task.
e183124
to
2d73dda
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
fdb165c
to
018e73b
Compare
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-datePerformance results were obtained by using Dataframe with shape (20000, 5000).