FIX-#6594: fix usage of Modin objects inside UDFs for `apply` #6673

anmyachev · 2023-10-22T23:03:45Z

What do these changes do?

The PR also fixes #4919

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves BUG: Using Modin objects within an apply fails, with unclear error message #6594
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

anmyachev · 2023-10-23T23:06:18Z

@Retribution98 could you check why test__reduce__ failed on unidist?

The error: KeyError: <weakref at 0x0000025BB1211220; to 'MasterDataID' at 0x0000025BB0F900A0> (that crashed python).

modin/pandas/test/dataframe/test_pickle.py

modin/core/execution/dask/implementations/pandas_on_dask/dataframe/dataframe.py

docs/conf.py

anmyachev · 2023-11-06T13:32:14Z

@dchigarev could you approve again?

dchigarev · 2023-11-07T10:23:07Z

modin/core/dataframe/pandas/dataframe/dataframe.py

+        -------
+        bool
+        """
+        return True


why would we ever want to recreate modin objects on a worker? wouldn't it make applying any method to them super slow anyway?

why would we ever want to recreate modin objects on a worker?

The main goal is to avoid materializing the dataframe in the main process and transfer this operation to the worker process.

wouldn't it make applying any method to them super slow anyway?

The only operation needed for this is taking the object by reference (get operation) to perform the conversion to pandas and perform any operations on the pandas object.

dchigarev · 2023-11-07T10:24:42Z

could you please also elaborate on how the changes fixes the problem?

anmyachev · 2023-11-08T13:09:37Z

could you please also elaborate on how the changes fixes the problem?

Instead of creating a Modin object in the worker process and performing operations on it, a Modin object will be created, converted to a pandas, and operations will in turn be performed on the pandas object.

dchigarev · 2023-11-13T13:01:20Z

modin/pandas/dataframe.py


        Returns
        -------
        DataFrame
            New ``DataFrame`` based on the `query_compiler`.
        """
+        if os.getpid() != source_pid:
+            return query_compiler.to_pandas()


correct me if I'm wrong, but my understanding is that we use _inflate_light() only when we unpickle a modin.DataFrame from the plasma storage (as if we wanted to read it from the disk, we would use _inflate_full()), so my question is, whether it makes sense to transfer the query compiler to workers and only then call .to_pandas()? Isn't calling .to_pandas() several times on every worker is more expensive than calling it only once in the main process when serializing? What are the benefits of using ._inflate_light()?

this is the only question that bothers me, otherwise, the PR looks good

Isn't calling .to_pandas() several times on every worker is more expensive than calling it only once in the main process when serializing?

It may be more expensive, but this method allows calculations to run asynchronously, which mitigates this problem (partially) given that worker processes tend to be under-loaded. On the other hand, the memory consumption on the other hand will be much greater, I believe that you are right and we need to make the call in the main process.

…pply' Signed-off-by: Anatoly Myachev <[email protected]>

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev · 2023-11-14T15:18:33Z

modin/core/dataframe/pandas/partitioning/partition_manager.py

+        # do the conversion to pandas once on the main process than several times
+        # on worker processes. Details: https://github.com/modin-project/modin/pull/6673/files#r1391086755
+        # For Dask, otherwise there may be an error: `coroutine 'Client._gather' was never awaited`
+        need_update = not PersistentPickle.get() and Engine.get() != "Dask"


@dchigarev I suppose we can leave the current implementation, but for those cases where to_pandas is called several times (for example, in different apply that go through preprocess), enable a mode in which to_pandas is called once in the main process.

I suppose we can leave the current implementation

You mean ._inflate_light()? Where else do we use it? If we are certain that we always want for dataframes to be persistently pickled, then I suppose we shouldn't leave this implementation. There's still a possibility that there are other places in our project where we submit kernels, but don't do this config variables manipulation, which will result into that the ._inflate_light() implementation will be called against our will.

my take is, that we either should drop the ._inflate_light() implementation at all, or remove these config variable manipulations and always execute .__reduce__() in accordance with what a user set in PersistentPickle variable. At this point I'm ok with both of the options, it's up to you to decide @anmyachev

You mean ._inflate_light()?

Yes

Where else do we use it?

In a situation like this (it seems to me that this case is almost never seen, if you think so, then I’ll delete _inflate_light):

modin/modin/pandas/test/dataframe/test_pickle.py

Line 46 in 41ecc92

other = pickle.loads(pickle.dumps(modin_df))

on the second thought, it maybe it make sense keeping inflate_light, let's leave it for now

anmyachev · 2023-11-14T17:26:57Z

@dchigarev unidist stuck on test_io.py again.

anmyachev · 2023-11-14T19:16:58Z

@dchigarev ready to merge

…pply` (modin-project#6673) Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev force-pushed the issue6594 branch 4 times, most recently from 6991a97 to 582355a Compare October 23, 2023 22:33

anmyachev mentioned this pull request Oct 24, 2023

ValueError: Unknown DataID! or KeyError: <weakref at 0x000002432FB250E0; to 'MasterDataID' at 0x000002432F873F40> modin-project/unidist#374

Open

anmyachev force-pushed the issue6594 branch from dd7101a to 468c6d9 Compare October 24, 2023 16:05

anmyachev commented Oct 24, 2023

View reviewed changes

modin/pandas/test/dataframe/test_pickle.py Outdated Show resolved Hide resolved

anmyachev marked this pull request as ready for review October 24, 2023 20:17

anmyachev requested review from a team as code owners October 24, 2023 20:17

Garra1980 reviewed Oct 24, 2023

View reviewed changes

modin/core/execution/dask/implementations/pandas_on_dask/dataframe/dataframe.py Outdated Show resolved Hide resolved

Garra1980 previously approved these changes Oct 24, 2023

View reviewed changes

anmyachev commented Oct 25, 2023

View reviewed changes

modin/core/execution/dask/implementations/pandas_on_dask/dataframe/dataframe.py Outdated Show resolved Hide resolved

anmyachev dismissed Garra1980’s stale review via fdbf3b4 October 25, 2023 13:24

dchigarev previously approved these changes Oct 25, 2023

View reviewed changes

anmyachev dismissed dchigarev’s stale review via e566f79 October 25, 2023 18:24

anmyachev force-pushed the issue6594 branch 2 times, most recently from 3a7aabd to 3becdaf Compare October 25, 2023 18:37

anmyachev commented Oct 25, 2023

View reviewed changes

docs/conf.py Outdated Show resolved Hide resolved

dchigarev reviewed Nov 7, 2023

View reviewed changes

anmyachev mentioned this pull request Nov 9, 2023

Try commits for Release 0.23.2 #6686

Closed

7 tasks

dchigarev reviewed Nov 13, 2023

View reviewed changes

anmyachev added 4 commits November 14, 2023 01:49

FIX-modin-project#6594: fix usage of Modin objects inside UDFs for 'a…

96cc494

…pply' Signed-off-by: Anatoly Myachev <[email protected]>

xfail mark

50dde2a

Signed-off-by: Anatoly Myachev <[email protected]>

dask fixes

d223793

Signed-off-by: Anatoly Myachev <[email protected]>

add workaround for unidist case

36bffbf

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev and others added 4 commits November 14, 2023 01:49

cleanup

1cecc04

Signed-off-by: Anatoly Myachev <[email protected]>

fix 'DataFrame.__reduce__'

04d24f9

Signed-off-by: Anatoly Myachev <[email protected]>

Apply suggestions from code review

69943e3

just a try

fb26aa8

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev force-pushed the issue6594 branch from 4dbf7d6 to fb26aa8 Compare November 14, 2023 00:49

anmyachev requested review from aregm, gshimansky, ienkovich, YarShev, vnlitvinov, AndreyPavlenko, devin-petersohn, mvashishtha and RehanSD as code owners November 14, 2023 00:49

workaround for Dask

3cb4b96

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev mentioned this pull request Nov 14, 2023

Dask doesn't work when PersistentPickle==True #6741

Open

anmyachev commented Nov 14, 2023

View reviewed changes

dchigarev approved these changes Nov 14, 2023

View reviewed changes

dchigarev merged commit 7de7b92 into modin-project:master Nov 14, 2023
45 checks passed

anmyachev deleted the issue6594 branch November 14, 2023 19:54

anmyachev added a commit to anmyachev/modin that referenced this pull request Nov 14, 2023

FIX-modin-project#6594: fix usage of Modin objects inside UDFs for `a…

271ba82

…pply` (modin-project#6673) Signed-off-by: Anatoly Myachev <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX-#6594: fix usage of Modin objects inside UDFs for `apply` #6673

FIX-#6594: fix usage of Modin objects inside UDFs for `apply` #6673

anmyachev commented Oct 22, 2023 •

edited

Loading

anmyachev commented Oct 23, 2023

anmyachev commented Nov 6, 2023

dchigarev Nov 7, 2023

anmyachev Nov 8, 2023

dchigarev commented Nov 7, 2023

anmyachev commented Nov 8, 2023

dchigarev Nov 13, 2023

dchigarev Nov 13, 2023

anmyachev Nov 13, 2023

anmyachev Nov 14, 2023

dchigarev Nov 14, 2023

dchigarev Nov 14, 2023

anmyachev Nov 14, 2023

dchigarev Nov 14, 2023

anmyachev commented Nov 14, 2023

anmyachev commented Nov 14, 2023

FIX-#6594: fix usage of Modin objects inside UDFs for apply #6673

FIX-#6594: fix usage of Modin objects inside UDFs for apply #6673

Conversation

anmyachev commented Oct 22, 2023 • edited Loading

What do these changes do?

anmyachev commented Oct 23, 2023

anmyachev commented Nov 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dchigarev commented Nov 7, 2023

anmyachev commented Nov 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anmyachev commented Nov 14, 2023

anmyachev commented Nov 14, 2023

FIX-#6594: fix usage of Modin objects inside UDFs for `apply` #6673

FIX-#6594: fix usage of Modin objects inside UDFs for `apply` #6673

anmyachev commented Oct 22, 2023 •

edited

Loading