Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#6594: fix usage of Modin objects inside UDFs for apply #6673

Merged
merged 9 commits into from
Nov 14, 2023

Conversation

anmyachev
Copy link
Collaborator

@anmyachev anmyachev commented Oct 22, 2023

What do these changes do?

The PR also fixes #4919

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves BUG: Using Modin objects within an apply fails, with unclear error message #6594
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

@anmyachev anmyachev force-pushed the issue6594 branch 4 times, most recently from 6991a97 to 582355a Compare October 23, 2023 22:33
@anmyachev
Copy link
Collaborator Author

@Retribution98 could you check why test__reduce__ failed on unidist?

The error: KeyError: <weakref at 0x0000025BB1211220; to 'MasterDataID' at 0x0000025BB0F900A0> (that crashed python).

@anmyachev anmyachev marked this pull request as ready for review October 24, 2023 20:17
@anmyachev anmyachev requested review from a team as code owners October 24, 2023 20:17
Garra1980
Garra1980 previously approved these changes Oct 24, 2023
dchigarev
dchigarev previously approved these changes Oct 25, 2023
docs/conf.py Outdated Show resolved Hide resolved
@anmyachev
Copy link
Collaborator Author

@dchigarev could you approve again?

-------
bool
"""
return True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would we ever want to recreate modin objects on a worker? wouldn't it make applying any method to them super slow anyway?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would we ever want to recreate modin objects on a worker?

The main goal is to avoid materializing the dataframe in the main process and transfer this operation to the worker process.

wouldn't it make applying any method to them super slow anyway?

The only operation needed for this is taking the object by reference (get operation) to perform the conversion to pandas and perform any operations on the pandas object.

@dchigarev
Copy link
Collaborator

could you please also elaborate on how the changes fixes the problem?

@anmyachev
Copy link
Collaborator Author

could you please also elaborate on how the changes fixes the problem?

Instead of creating a Modin object in the worker process and performing operations on it, a Modin object will be created, converted to a pandas, and operations will in turn be performed on the pandas object.

@anmyachev anmyachev mentioned this pull request Nov 9, 2023
7 tasks

Returns
-------
DataFrame
New ``DataFrame`` based on the `query_compiler`.
"""
if os.getpid() != source_pid:
return query_compiler.to_pandas()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct me if I'm wrong, but my understanding is that we use _inflate_light() only when we unpickle a modin.DataFrame from the plasma storage (as if we wanted to read it from the disk, we would use _inflate_full()), so my question is, whether it makes sense to transfer the query compiler to workers and only then call .to_pandas()? Isn't calling .to_pandas() several times on every worker is more expensive than calling it only once in the main process when serializing? What are the benefits of using ._inflate_light()?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the only question that bothers me, otherwise, the PR looks good

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't calling .to_pandas() several times on every worker is more expensive than calling it only once in the main process when serializing?

It may be more expensive, but this method allows calculations to run asynchronously, which mitigates this problem (partially) given that worker processes tend to be under-loaded. On the other hand, the memory consumption on the other hand will be much greater, I believe that you are right and we need to make the call in the main process.

Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
anmyachev and others added 4 commits November 14, 2023 01:49
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
# do the conversion to pandas once on the main process than several times
# on worker processes. Details: https://github.com/modin-project/modin/pull/6673/files#r1391086755
# For Dask, otherwise there may be an error: `coroutine 'Client._gather' was never awaited`
need_update = not PersistentPickle.get() and Engine.get() != "Dask"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dchigarev I suppose we can leave the current implementation, but for those cases where to_pandas is called several times (for example, in different apply that go through preprocess), enable a mode in which to_pandas is called once in the main process.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we can leave the current implementation

You mean ._inflate_light()? Where else do we use it? If we are certain that we always want for dataframes to be persistently pickled, then I suppose we shouldn't leave this implementation. There's still a possibility that there are other places in our project where we submit kernels, but don't do this config variables manipulation, which will result into that the ._inflate_light() implementation will be called against our will.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my take is, that we either should drop the ._inflate_light() implementation at all, or remove these config variable manipulations and always execute .__reduce__() in accordance with what a user set in PersistentPickle variable. At this point I'm ok with both of the options, it's up to you to decide @anmyachev

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean ._inflate_light()?

Yes

Where else do we use it?

In a situation like this (it seems to me that this case is almost never seen, if you think so, then I’ll delete _inflate_light):

other = pickle.loads(pickle.dumps(modin_df))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the second thought, it maybe it make sense keeping inflate_light, let's leave it for now

@anmyachev
Copy link
Collaborator Author

@dchigarev unidist stuck on test_io.py again.

@anmyachev
Copy link
Collaborator Author

@dchigarev ready to merge

@dchigarev dchigarev merged commit 7de7b92 into modin-project:master Nov 14, 2023
45 checks passed
@anmyachev anmyachev deleted the issue6594 branch November 14, 2023 19:54
anmyachev added a commit to anmyachev/modin that referenced this pull request Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants