Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#6767: Provide the ability to use experimental functionality when experimental mode is not enabled globally via an environment variable #6764

Merged
merged 19 commits into from
Dec 8, 2023

Conversation

anmyachev
Copy link
Collaborator

@anmyachev anmyachev commented Nov 23, 2023

What do these changes do?

Main changes:

  • New enable_exp_mode context manager
  • New attribute for accessing experimental functionality: modin = CachedAccessor("modin", ExperimentalFunctions)
    • This is done in order to avoid adding new methods to the dataframe when importing module modin.experimental.pandas. When switching between two modes (Experimental and non experimental), it would be quite difficult to remove/add them again.
    • It is not advisable to add experimental functions to the main namespace, so as not to mix them with pandas functions and confuse the users.
  • As a side effect, the factory is now initialized more than once when using enable_exp_mode context manager.

@anmyachev anmyachev changed the title FEAT-#0000: make pickling methods available when experimental mode is activated in runtime FEAT-#6767: Implement 'modin.utils.enable_exp_mode' context manager Nov 23, 2023
modin/utils.py Fixed Show fixed Hide fixed
@anmyachev anmyachev force-pushed the to-pickle-new branch 2 times, most recently from 6e141e8 to 31dfdf1 Compare November 23, 2023 22:24
@anmyachev anmyachev changed the title FEAT-#6767: Implement 'modin.utils.enable_exp_mode' context manager FEAT-#6767: Implement modin.utils.enable_exp_mode context manager Nov 23, 2023
modin/utils.py Outdated
FactoryDispatcher._update_factory_if_already_initialized()
yield
finally:
IsExperimental.put(old_value or False)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old_value can be None

@anmyachev anmyachev marked this pull request as ready for review November 23, 2023 23:52
"""

def __init__(self, data):
if IsExperimental.get() is not True:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of this requirement? I mean, if a user already gone that far by accessing an underscore df._exp field, then he probably understands what he's doing and his intention to use experimental API is certain enough.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could have added a decorator for each of the methods in ExperimentalFunctions that would enable/disable experimental mode in the fly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dchigarev done, ready for review

Signed-off-by: Anatoly Myachev <[email protected]>
modin/pandas/accessor.py Fixed Show fixed Hide fixed
modin/utils.py Outdated
old_value = IsExperimental.get()
try:
IsExperimental.put(True)
FactoryDispatcher._update_factory_if_already_initialized()
Copy link
Collaborator

@dchigarev dchigarev Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we hide the factory updating to the IsExperimental.put() method? We can then make this context function generic, so it could work with other config variables that we tend to change often (for example, ExperimentalGroupbyImpl, ExperimentalNumpyAPI, AsyncReading, ...):

with modin.utils.set_config_value(IsExperimental, value=True):
    ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An interesting idea, it seems possible, but let's check it separately.

Comment on lines 51 to 52
setattr(DataFrame, "to_pickle_distributed", to_pickle_distributed) # noqa: F405

Copy link
Collaborator

@dchigarev dchigarev Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? it seems like it breaks the old use-case:

from modin.config import IsExperimental
IsExperimental.put(True)

import modin.experimental.pandas as pd
pd.DataFrame(...).to_pickle_distributed() # fails

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the new behavior via ._exp to be adding a new way of invoking experimental method, but not replacing the previous one

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it breaks. Initially, this was done without taking into account the case when the experimental mode should work in conjunction with the regular one. Updating attributes during import makes context switching very difficult.

Since this is experimental code, I thought that such a breaking change would not be a big problem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating attributes during import makes context switching very difficult

why? shouldn't enable_exp_mode context do nothing if the experimental mode is already enabled?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable_exp_mode context do nothing if the experimental mode is already enabled?

Yes, but what if import modin.experimental.pandas is made in this context? The expected behavior would be that when exiting the context, no side effects remain.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it still should be fine then. If I understand correctly, after we leave the context, the experimental methods of a DataFrame will still be accessible but not functioning, which seems like a pretty logical behavior

with enable_exp_mode():
    import modin.experimental.pandas as pd
    pd.DataFrame([]).to_pickle_distributed() # works

pd.DataFrame([]).to_pickle_distributed() # raises: not in experimental mode

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I brought back this function, but added a warning about future removal and an alternative to use. I hope this suits you.

Comment on lines 3173 to 3174
# Namespace for experimental functions
_exp = CachedAccessor("_exp", ExperimentalFunctions)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding a private accessor to be used by users so they can access experimental functionality is fine with me, however, would like to hear opinions from others @Garra1980 @YarShev

Signed-off-by: Anatoly Myachev <[email protected]>
@YarShev
Copy link
Collaborator

YarShev commented Nov 27, 2023

Maybe we should initialize both a primary factory and a respective expetimental factory at once and expose experimental pandas API that is slightly different to the original one? Doing so we could avoid an extra accessor _exp and a new context manager enable_exp_mode, which seems to me not user-friendly a bit.

@anmyachev
Copy link
Collaborator Author

and expose experimental pandas API that is slightly different to the original one

Do you mean to provide access to experimental functions in the same namespace as regular functions, but which will only work when experimental mode is enabled?

Maybe we should initialize both a primary factory and a respective expetimental factory at once

Moreover, we could simply combine them into one.

@YarShev
Copy link
Collaborator

YarShev commented Nov 28, 2023

Do you mean to provide access to experimental functions in the same namespace as regular functions, but which will only work when experimental mode is enabled?

Through modin.experimental.pandas and the experimental API shouldn't be depending on MODIN_EXPERIMENTAL.

Moreover, we could simply combine them into one.

We could probably combine some of them into one, and, I think since we have the executions such as ExpetimentalPyarrowOnRay and ExperimentalHdkOnNative, we should provide them when MODIN_EXPERIMENTAL is enabled, i.e. as is now.

@anmyachev
Copy link
Collaborator Author

Through modin.experimental.pandas and the experimental API shouldn't be depending on MODIN_EXPERIMENTAL.

@YarShev I don’t really understand how in this case we will be able to provide access to experimental dataframe methods. Could you provide some more details?

Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Comment on lines +95 to +97
read_sql_distributed = __make_read(
ExperimentalSQLDispatcher, build_args={**build_args, "base_read": read_sql}
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The experimental implementation of reading sql on the master is called a non-experimental implementation of the same function. Due to the fact that classes are built from several other classes at runtime, I have not found a more beautiful way to do the same thing, given the new class architecture (in the PR).

Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
@anmyachev anmyachev changed the title FEAT-#6767: Implement modin.utils.enable_exp_mode context manager FEAT-#6767: Provide the ability to use experimental functionality when experimental mode is not enabled globally via an environment variable Dec 5, 2023
Signed-off-by: Anatoly Myachev <[email protected]>
@@ -115,7 +115,7 @@ def read_sql(

assert IsExperimental.get(), "This only works in experimental mode"

result = FactoryDispatcher.read_sql(**kwargs)
result = FactoryDispatcher.read_sql_distributed(**kwargs)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new name is for internal use only.

docs/usage_guide/advanced_usage/index.rst Outdated Show resolved Hide resolved
modin/core/execution/dispatching/factories/factories.py Outdated Show resolved Hide resolved
modin/core/execution/dispatching/factories/factories.py Outdated Show resolved Hide resolved
modin/experimental/pandas/__init__.py Outdated Show resolved Hide resolved
modin/experimental/pandas/io.py Show resolved Hide resolved
modin/experimental/pandas/test/test_io_exp.py Show resolved Hide resolved
modin/pandas/accessor.py Outdated Show resolved Hide resolved
modin/pandas/dataframe.py Show resolved Hide resolved
modin/pandas/test/test_api.py Show resolved Hide resolved
)
def _read_sql_distributed(cls, **kwargs):
current_execution = get_current_execution()
if current_execution not in supported_executions:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this check into existing is_distributed method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires adding the following attribute to all IO classes:

    read_sql_distributed = __make_read(
        ExperimentalSQLDispatcher, build_args={**build_args, "base_read": read_sql}
    )

Looks like a complication.

@YarShev
Copy link
Collaborator

YarShev commented Dec 7, 2023

Left a few comments + there are some not addressed ones. Other than that, LGTM.

Signed-off-by: Anatoly Myachev <[email protected]>
@@ -13,4 +13,4 @@ Experimental API Reference
.. autofunction:: read_csv_glob
.. autofunction:: read_custom_text
.. autofunction:: read_pickle_distributed
.. automethod:: modin.experimental.pandas.DataFrame.to_pickle_distributed
.. automethod:: modin.pandas.DataFrame.modin::to_pickle_distributed
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a hint to autodoc extension, you can put a :: separator in between module name and object name to let autodoc know the correct module name if it is ambiguous. from https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-automethod

@YarShev
Copy link
Collaborator

YarShev commented Dec 7, 2023

@dchigarev, @Garra1980, any other comments?

@YarShev YarShev merged commit c3a4f78 into modin-project:master Dec 8, 2023
53 checks passed
@anmyachev anmyachev deleted the to-pickle-new branch December 8, 2023 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants