-
Notifications
You must be signed in to change notification settings - Fork 654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#6594: fix usage of Modin objects inside UDFs for apply
#6673
Conversation
6991a97
to
582355a
Compare
@Retribution98 could you check why The error: |
modin/core/execution/dask/implementations/pandas_on_dask/dataframe/dataframe.py
Outdated
Show resolved
Hide resolved
modin/core/execution/dask/implementations/pandas_on_dask/dataframe/dataframe.py
Outdated
Show resolved
Hide resolved
3a7aabd
to
3becdaf
Compare
@dchigarev could you approve again? |
------- | ||
bool | ||
""" | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would we ever want to recreate modin objects on a worker? wouldn't it make applying any method to them super slow anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would we ever want to recreate modin objects on a worker?
The main goal is to avoid materializing the dataframe in the main process and transfer this operation to the worker process.
wouldn't it make applying any method to them super slow anyway?
The only operation needed for this is taking the object by reference (get
operation) to perform the conversion to pandas and perform any operations on the pandas object.
could you please also elaborate on how the changes fixes the problem? |
Instead of creating a Modin object in the worker process and performing operations on it, a Modin object will be created, converted to a pandas, and operations will in turn be performed on the pandas object. |
|
||
Returns | ||
------- | ||
DataFrame | ||
New ``DataFrame`` based on the `query_compiler`. | ||
""" | ||
if os.getpid() != source_pid: | ||
return query_compiler.to_pandas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct me if I'm wrong, but my understanding is that we use _inflate_light()
only when we unpickle a modin.DataFrame
from the plasma storage (as if we wanted to read it from the disk, we would use _inflate_full()
), so my question is, whether it makes sense to transfer the query compiler to workers and only then call .to_pandas()
? Isn't calling .to_pandas()
several times on every worker is more expensive than calling it only once in the main process when serializing? What are the benefits of using ._inflate_light()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the only question that bothers me, otherwise, the PR looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't calling .to_pandas() several times on every worker is more expensive than calling it only once in the main process when serializing?
It may be more expensive, but this method allows calculations to run asynchronously, which mitigates this problem (partially) given that worker processes tend to be under-loaded. On the other hand, the memory consumption on the other hand will be much greater, I believe that you are right and we need to make the call in the main process.
…pply' Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
# do the conversion to pandas once on the main process than several times | ||
# on worker processes. Details: https://github.com/modin-project/modin/pull/6673/files#r1391086755 | ||
# For Dask, otherwise there may be an error: `coroutine 'Client._gather' was never awaited` | ||
need_update = not PersistentPickle.get() and Engine.get() != "Dask" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchigarev I suppose we can leave the current implementation, but for those cases where to_pandas
is called several times (for example, in different apply
that go through preprocess
), enable a mode in which to_pandas
is called once in the main process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we can leave the current implementation
You mean ._inflate_light()
? Where else do we use it? If we are certain that we always want for dataframes to be persistently pickled, then I suppose we shouldn't leave this implementation. There's still a possibility that there are other places in our project where we submit kernels, but don't do this config variables manipulation, which will result into that the ._inflate_light()
implementation will be called against our will.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my take is, that we either should drop the ._inflate_light()
implementation at all, or remove these config variable manipulations and always execute .__reduce__()
in accordance with what a user set in PersistentPickle
variable. At this point I'm ok with both of the options, it's up to you to decide @anmyachev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean ._inflate_light()?
Yes
Where else do we use it?
In a situation like this (it seems to me that this case is almost never seen, if you think so, then I’ll delete _inflate_light
):
other = pickle.loads(pickle.dumps(modin_df)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the second thought, it maybe it make sense keeping inflate_light
, let's leave it for now
@dchigarev unidist stuck on |
@dchigarev ready to merge |
…pply` (modin-project#6673) Signed-off-by: Anatoly Myachev <[email protected]>
What do these changes do?
The PR also fixes #4919
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
apply
fails, with unclear error message #6594docs/development/architecture.rst
is up-to-date