Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOCS-#7217: Update docs as to when Modin operators work best #7218

Merged
merged 2 commits into from
Apr 26, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions docs/flow/modin/core/dataframe/algebra.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,11 @@ Uniformly apply a function argument to each partition in parallel.
.. figure:: /img/map_evaluation.svg
:align: center

This operator performs best when the number of partitions equals to the number of CPUs
so that each single partition gets processed in parallel. When the number of partitions is 1.5x greater than
the number of CPUs, Modin applies a heuristic to join some partitions to get "ideal" partitioning so that
each new partition gets processed in parallel.

Reduce operator
---------------
Applies an argument function that reduces each column or row on the specified axis into a scalar, but requires knowledge about the whole axis.
Expand All @@ -43,13 +48,19 @@ that the reduce function returns a one dimensional frame.
.. figure:: /img/reduce_evaluation.svg
:align: center

This operator performs best when the number of partitions (row or column partitions in depend on the specified axis)
equals to the number of CPUs so that each single axis partition gets processed in parallel.

TreeReduce operator
-------------------
Applies an argument function that reduces specified axis into a scalar. First applies map function to each partition
in parallel, then concatenates resulted partitions along the specified axis and applies reduce
function. In contrast with `Map function` template, here you're allowed to change partition shape
in the map phase. Note that the execution engine expects that the reduce function returns a one dimensional frame.

This operator performs best when the number of partitions (including the initial and intermediate stages)
equals to the number of CPUs so that each single axis partition gets processed in parallel.

Binary operator
---------------
Applies an argument function, that takes exactly two operands (first is always `QueryCompiler`).
Expand All @@ -65,20 +76,35 @@ the right operand to the left.
it automatically but note that this requires repartitioning, which is a much
more expensive operation than the binary function itself.

This operator performs best when both operands have identical partitioning and the number of partitions of an operand
equals to the number of CPUs so that each single partition gets processed in parallel.

Fold operator
-------------
Applies an argument function that requires knowledge of the whole axis. Be aware that providing this knowledge may be
expensive because the execution engine has to concatenate partitions along the specified axis.

This operator performs best when the number of partitions (row or column partitions in depend on the specified axis)
equals to the number of CPUs so that each single axis partition gets processed in parallel.

GroupBy operator
----------------
Evaluates GroupBy aggregation for that type of functions that can be executed via TreeReduce approach.
To be able to form groups engine broadcasts ``by`` partitions to each partition of the source frame.

This operator performs best when the cardinality of ``by`` columns is low (small number of output groups).
At the ``Map`` stage, the operator computes the aggregation for each row partition individually, meaning,
that the ``Reduce`` stage takes a dataframe with the following number of rows:
``num_groups * n_row_parts``. If the number of groups is too high, there's a risk of getting a dataframe
with even bigger than the initial shape at the ``Reduce`` stage.

Default-to-pandas operator
--------------------------
Do :doc:`fallback to pandas </supported_apis/defaulting_to_pandas>` for passed function.

This operator has a performance penalty for going from a partitioned Modin DataFrame to pandas because of
the communication cost and single-threaded nature of pandas.


How to register your own function
'''''''''''''''''''''''''''''''''
Expand Down
Loading