FEAT-#5221: add `execute` to trigger lazy computations and wait for them to complete #6648

anmyachev · 2023-10-13T22:47:12Z

What do these changes do?

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Add trigger_execution to dataframe and series #5221
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

Egor-Krivov · 2023-10-17T14:27:02Z

modin/utils.py

@@ -606,6 +606,46 @@ def try_cast_to_pandas(obj: Any, squeeze: bool = False) -> Any:
    return obj


+def trigger_import(obj: Any) -> None:


I feel like we need to notify user if this trigger failed (due to old HDK for instance).

Do we guarantee that reading was executed for modin on ray?

I feel like we need to notify user if this trigger failed (due to old HDK for instance).

I believe there will be an exception from HDK in this case. It's enough?

So this function should only be called with HDK dataframe?

Your questions got me thinking about whether data trigger functionality is needed. Shouldn't we do this always when we use the functionality to materialize all computations?

It’s also interesting to know @AndreyPavlenko opinion on this matter.

No. The data is only imported before the execution on the HDK side. Some operations could be performed with arrow, in this case, the data is not imported. The arrow execution is only performed if we have an arrow table in partitions. When we force import, the arrow table is imported to HDK. After that, we don't have the arrow table anymore and, thus, the subsequent execution to be performed with HDK. In benchmarks, force import is used to separate the data load and the execution time counting.

In benchmarks, force import is used to separate the data load and the execution time counting.

But how can we make fair measurements in this case? In one case, we know that the next operation will be performed on HDK, so we import data so as not to measure the time of data movement, but in the other case, not.

The import time is measured separately, on the load data stage.

I see, thanks

Do we guarantee that reading was executed for modin on ray?

@Egor-Krivov this concept does not exist for any engine other than HDK. Therefore, I removed the separate function and left a specific parameter for this with the mention that it can be used for different engines, but any actions will be performed only for HDK.

Egor-Krivov · 2023-10-17T14:38:10Z

I would suggest different name for this call. wait_computations is too long and not obvious.
I suggest one of:
collect - used in polars.
run - used in pyhdk
execute
compute

…utations and wait for them to complete Signed-off-by: Anatoly Myachev <[email protected]>

Signed-off-by: Anatoly Myachev <[email protected]>

Garra1980 · 2023-10-20T17:42:15Z

I would say we need to explicitly mention new functionality in the docs.

Egor-Krivov · 2023-10-23T12:35:47Z

modin/utils.py

+        if not hasattr(obj, "_query_compiler"):
+            continue
+        query_compiler = obj._query_compiler
+        query_compiler.execute()


Please note, that current implementation in timedf doesn't perform query_compiler.excute() when trigger_hdk_import is True. I don't remember exact reason, but it was part of our discussion in https://github.com/intel-ai/timedf/pull/460

@AndreyPavlenko Would this unconditional execute be a problem?

If the operations have been executed with HDK, then we have nothing to import here, if with arrow - we will just import the arrow table to HDK, that could be redundant.

I think, force_import() should be a separate operation.

@AndreyPavlenko Can current implementation negatively affect performance or it's just harmless redundancy?

It should not have any impact on performance, but, for example, if you want to measure an arrow-based execution time, you will not get a precise execution time, because the import will be performed after the execution. I.e. you will get the execution + import time.

Discussed this issue offline with @AndreyPavlenko & @anmyachev , decided to keep the current implementation

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev · 2023-10-23T13:23:29Z

I would say we need to explicitly mention new functionality in the docs.

@Garra1980 new doc page: https://modin--6648.org.readthedocs.build/en/6648/flow/modin/utils.html

Signed-off-by: Anatoly Myachev <[email protected]>

…nd wait for them to complete (modin-project#6648) Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev force-pushed the issue5221 branch 3 times, most recently from dc00c0b to ae8c757 Compare October 16, 2023 13:02

Egor-Krivov reviewed Oct 17, 2023

View reviewed changes

anmyachev added 2 commits October 19, 2023 21:38

FEAT-modin-project#5221: add 'wait_computations' to trigger lazy comp…

144a4bc

…utations and wait for them to complete Signed-off-by: Anatoly Myachev <[email protected]>

rename to 'execute'

30151f6

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev force-pushed the issue5221 branch from ae8c757 to 30151f6 Compare October 19, 2023 19:38

anmyachev changed the title ~~FEAT-#5221: add wait_computations to trigger lazy computations and wait for them to complete~~ FEAT-#5221: add execute to trigger lazy computations and wait for them to complete Oct 19, 2023

trigger import always if possible; cleanup

8ef1453

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev force-pushed the issue5221 branch from 0a356c4 to 8ef1453 Compare October 19, 2023 21:07

anmyachev added 2 commits October 20, 2023 14:26

fix

76aec65

Signed-off-by: Anatoly Myachev <[email protected]>

allow to pass iterable object

ad2facd

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev force-pushed the issue5221 branch from 483438e to ad2facd Compare October 20, 2023 13:53

fixes

0f1d341

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev force-pushed the issue5221 branch from 13c7c27 to 516b1b9 Compare October 20, 2023 16:06

anmyachev added 2 commits October 20, 2023 18:23

cleanup

d235435

Signed-off-by: Anatoly Myachev <[email protected]>

update test

49c30f4

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev force-pushed the issue5221 branch from 516b1b9 to 49c30f4 Compare October 20, 2023 16:47

anmyachev marked this pull request as ready for review October 20, 2023 17:30

anmyachev requested review from a team as code owners October 20, 2023 17:30

Egor-Krivov reviewed Oct 23, 2023

View reviewed changes

add docs

404c2ba

Signed-off-by: Anatoly Myachev <[email protected]>

suggestion to use 'execute' instead of 'repr'

55d9a86

Signed-off-by: Anatoly Myachev <[email protected]>

dchigarev approved these changes Oct 25, 2023

View reviewed changes

anmyachev merged commit e558d9d into modin-project:master Oct 25, 2023
37 checks passed

anmyachev deleted the issue5221 branch October 25, 2023 17:43

anmyachev added a commit to anmyachev/modin that referenced this pull request Oct 25, 2023

FEAT-modin-project#5221: add execute to trigger lazy computations a…

e566f79

…nd wait for them to complete (modin-project#6648) Signed-off-by: Anatoly Myachev <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT-#5221: add `execute` to trigger lazy computations and wait for them to complete #6648

FEAT-#5221: add `execute` to trigger lazy computations and wait for them to complete #6648

anmyachev commented Oct 13, 2023 •

edited

Loading

Egor-Krivov Oct 17, 2023

Egor-Krivov Oct 17, 2023

anmyachev Oct 18, 2023

Egor-Krivov Oct 18, 2023

anmyachev Oct 20, 2023

AndreyPavlenko Oct 20, 2023

anmyachev Oct 20, 2023

AndreyPavlenko Oct 20, 2023

anmyachev Oct 20, 2023

anmyachev Oct 23, 2023

Egor-Krivov commented Oct 17, 2023

Garra1980 commented Oct 20, 2023

Egor-Krivov Oct 23, 2023 •

edited

Loading

AndreyPavlenko Oct 23, 2023

Egor-Krivov Oct 23, 2023 •

edited

Loading

AndreyPavlenko Oct 23, 2023

Egor-Krivov Oct 23, 2023

anmyachev commented Oct 23, 2023

		@@ -606,6 +606,46 @@ def try_cast_to_pandas(obj: Any, squeeze: bool = False) -> Any:
		return obj


		def trigger_import(obj: Any) -> None:

FEAT-#5221: add execute to trigger lazy computations and wait for them to complete #6648

FEAT-#5221: add execute to trigger lazy computations and wait for them to complete #6648

Conversation

anmyachev commented Oct 13, 2023 • edited Loading

What do these changes do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Egor-Krivov commented Oct 17, 2023

Garra1980 commented Oct 20, 2023

Egor-Krivov Oct 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Egor-Krivov Oct 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anmyachev commented Oct 23, 2023

FEAT-#5221: add `execute` to trigger lazy computations and wait for them to complete #6648

FEAT-#5221: add `execute` to trigger lazy computations and wait for them to complete #6648

anmyachev commented Oct 13, 2023 •

edited

Loading

Egor-Krivov Oct 23, 2023 •

edited

Loading

Egor-Krivov Oct 23, 2023 •

edited

Loading