Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#5221: add execute to trigger lazy computations and wait for them to complete #6648

Merged
merged 10 commits into from
Oct 25, 2023

Conversation

anmyachev
Copy link
Collaborator

@anmyachev anmyachev commented Oct 13, 2023

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Add trigger_execution to dataframe and series #5221
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

@anmyachev anmyachev force-pushed the issue5221 branch 3 times, most recently from dc00c0b to ae8c757 Compare October 16, 2023 13:02
modin/utils.py Outdated
@@ -606,6 +606,46 @@ def try_cast_to_pandas(obj: Any, squeeze: bool = False) -> Any:
return obj


def trigger_import(obj: Any) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we need to notify user if this trigger failed (due to old HDK for instance).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we guarantee that reading was executed for modin on ray?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we need to notify user if this trigger failed (due to old HDK for instance).

I believe there will be an exception from HDK in this case. It's enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this function should only be called with HDK dataframe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your questions got me thinking about whether data trigger functionality is needed. Shouldn't we do this always when we use the functionality to materialize all computations?

It’s also interesting to know @AndreyPavlenko opinion on this matter.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The data is only imported before the execution on the HDK side. Some operations could be performed with arrow, in this case, the data is not imported. The arrow execution is only performed if we have an arrow table in partitions. When we force import, the arrow table is imported to HDK. After that, we don't have the arrow table anymore and, thus, the subsequent execution to be performed with HDK. In benchmarks, force import is used to separate the data load and the execution time counting.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In benchmarks, force import is used to separate the data load and the execution time counting.

But how can we make fair measurements in this case? In one case, we know that the next operation will be performed on HDK, so we import data so as not to measure the time of data movement, but in the other case, not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import time is measured separately, on the load data stage.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we guarantee that reading was executed for modin on ray?

@Egor-Krivov this concept does not exist for any engine other than HDK. Therefore, I removed the separate function and left a specific parameter for this with the mention that it can be used for different engines, but any actions will be performed only for HDK.

@Egor-Krivov
Copy link
Contributor

I would suggest different name for this call. wait_computations is too long and not obvious.
I suggest one of:
collect - used in polars.
run - used in pyhdk
execute
compute

…utations and wait for them to complete

Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
@anmyachev anmyachev changed the title FEAT-#5221: add wait_computations to trigger lazy computations and wait for them to complete FEAT-#5221: add execute to trigger lazy computations and wait for them to complete Oct 19, 2023
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
@anmyachev anmyachev marked this pull request as ready for review October 20, 2023 17:30
@anmyachev anmyachev requested review from a team as code owners October 20, 2023 17:30
@Garra1980
Copy link
Collaborator

I would say we need to explicitly mention new functionality in the docs.

if not hasattr(obj, "_query_compiler"):
continue
query_compiler = obj._query_compiler
query_compiler.execute()
Copy link
Contributor

@Egor-Krivov Egor-Krivov Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note, that current implementation in timedf doesn't perform query_compiler.excute() when trigger_hdk_import is True. I don't remember exact reason, but it was part of our discussion in https://github.com/intel-ai/timedf/pull/460

@AndreyPavlenko Would this unconditional execute be a problem?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the operations have been executed with HDK, then we have nothing to import here, if with arrow - we will just import the arrow table to HDK, that could be redundant.

I think, force_import() should be a separate operation.

Copy link
Contributor

@Egor-Krivov Egor-Krivov Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndreyPavlenko Can current implementation negatively affect performance or it's just harmless redundancy?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not have any impact on performance, but, for example, if you want to measure an arrow-based execution time, you will not get a precise execution time, because the import will be performed after the execution. I.e. you will get the execution + import time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this issue offline with @AndreyPavlenko & @anmyachev , decided to keep the current implementation

Signed-off-by: Anatoly Myachev <[email protected]>
@anmyachev
Copy link
Collaborator Author

I would say we need to explicitly mention new functionality in the docs.

@Garra1980 new doc page: https://modin--6648.org.readthedocs.build/en/6648/flow/modin/utils.html

@anmyachev anmyachev merged commit e558d9d into modin-project:master Oct 25, 2023
37 checks passed
@anmyachev anmyachev deleted the issue5221 branch October 25, 2023 17:43
anmyachev added a commit to anmyachev/modin that referenced this pull request Oct 25, 2023
…nd wait for them to complete (modin-project#6648)

Signed-off-by: Anatoly Myachev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add trigger_execution to dataframe and series
5 participants