-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#5221: add execute
to trigger lazy computations and wait for them to complete
#6648
Conversation
dc00c0b
to
ae8c757
Compare
modin/utils.py
Outdated
@@ -606,6 +606,46 @@ def try_cast_to_pandas(obj: Any, squeeze: bool = False) -> Any: | |||
return obj | |||
|
|||
|
|||
def trigger_import(obj: Any) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we need to notify user if this trigger failed (due to old HDK for instance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we guarantee that reading was executed for modin on ray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we need to notify user if this trigger failed (due to old HDK for instance).
I believe there will be an exception from HDK in this case. It's enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this function should only be called with HDK dataframe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your questions got me thinking about whether data trigger functionality is needed. Shouldn't we do this always when we use the functionality to materialize all computations?
It’s also interesting to know @AndreyPavlenko opinion on this matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. The data is only imported before the execution on the HDK side. Some operations could be performed with arrow, in this case, the data is not imported. The arrow execution is only performed if we have an arrow table in partitions. When we force import, the arrow table is imported to HDK. After that, we don't have the arrow table anymore and, thus, the subsequent execution to be performed with HDK. In benchmarks, force import is used to separate the data load and the execution time counting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In benchmarks, force import is used to separate the data load and the execution time counting.
But how can we make fair measurements in this case? In one case, we know that the next operation will be performed on HDK, so we import data so as not to measure the time of data movement, but in the other case, not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import time is measured separately, on the load data stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we guarantee that reading was executed for modin on ray?
@Egor-Krivov this concept does not exist for any engine other than HDK. Therefore, I removed the separate function and left a specific parameter for this with the mention that it can be used for different engines, but any actions will be performed only for HDK.
I would suggest different name for this call. |
…utations and wait for them to complete Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
wait_computations
to trigger lazy computations and wait for them to completeexecute
to trigger lazy computations and wait for them to complete
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
I would say we need to explicitly mention new functionality in the docs. |
if not hasattr(obj, "_query_compiler"): | ||
continue | ||
query_compiler = obj._query_compiler | ||
query_compiler.execute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note, that current implementation in timedf doesn't perform query_compiler.excute() when trigger_hdk_import is True. I don't remember exact reason, but it was part of our discussion in https://github.com/intel-ai/timedf/pull/460
@AndreyPavlenko Would this unconditional execute be a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the operations have been executed with HDK, then we have nothing to import here, if with arrow - we will just import the arrow table to HDK, that could be redundant.
I think, force_import() should be a separate operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AndreyPavlenko Can current implementation negatively affect performance or it's just harmless redundancy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not have any impact on performance, but, for example, if you want to measure an arrow-based execution time, you will not get a precise execution time, because the import will be performed after the execution. I.e. you will get the execution + import time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed this issue offline with @AndreyPavlenko & @anmyachev , decided to keep the current implementation
Signed-off-by: Anatoly Myachev <[email protected]>
@Garra1980 new doc page: https://modin--6648.org.readthedocs.build/en/6648/flow/modin/utils.html |
Signed-off-by: Anatoly Myachev <[email protected]>
…nd wait for them to complete (modin-project#6648) Signed-off-by: Anatoly Myachev <[email protected]>
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date