Skip to content

Commit

Permalink
FEAT-modin-project#6908: Remove the warning regarding engine initiali…
Browse files Browse the repository at this point in the history
…zation

Signed-off-by: Igoshev, Iaroslav <[email protected]>
  • Loading branch information
YarShev committed Feb 6, 2024
1 parent 807298d commit 0cec706
Show file tree
Hide file tree
Showing 20 changed files with 369 additions and 4,841 deletions.
1 change: 1 addition & 0 deletions docs/getting_started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ For the purpose of demonstration, we will load in modin as ``pd`` and pandas as
#############################################
import time
import ray
# Look at the Ray documentation with respect to the Ray configuration suited to you most.
ray.init()
#############################################
Expand Down
5 changes: 4 additions & 1 deletion docs/getting_started/troubleshooting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,7 @@ once Python interpreter is started in them so that to avoid a race condition in
import modin.pandas as pd
import modin.config as cfg
# Look at the Ray documentation with respect to the Ray configuration suited to you most.
ray.init(runtime_env={'env_vars': {'__MODIN_AUTOIMPORT_PANDAS__': '1'}})
pandas_df = pandas.DataFrame(
Expand Down Expand Up @@ -357,7 +358,9 @@ or
cfg.Engine.put("dask")
if __name__ == "__main__":
client = Client() # Explicit Dask Client creation.
# Explicit Dask Client creation.
# Look at the Dask Distributed documentation with respect to the Client configuration suited to you most.
client = Client()
df = pd.DataFrame([0, 1, 2, 3])
print(df)
Expand Down
49 changes: 0 additions & 49 deletions docs/getting_started/using_modin/using_modin_locally.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,55 +23,6 @@ just like you would pandas, since the API is identical to pandas.
**That's it. You're ready to use Modin on your previous pandas workflows!**

Optional Configurations
-----------------------

When using Modin locally on a single machine or laptop (without a cluster), Modin will
automatically create and manage a local Dask or Ray cluster for the executing your
code. So when you run an operation for the first time with Modin, you will see a
message like this, indicating that a Modin has automatically initialized a local
cluster for you:

.. code-block:: python
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
.. code-block:: text
UserWarning: Ray execution environment not yet initialized. Initializing...
To remove this warning, run the following python code before doing dataframe operations:
import ray
ray.init()
If you prefer to use Dask over Ray as your execution backend, you can use the
following code to modify the default configuration:
.. code-block:: python
import modin
modin.config.Engine.put("Dask")
.. code-block:: python
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
.. code-block:: text
UserWarning: Dask execution environment not yet initialized. Initializing...
To remove this warning, run the following python code before doing dataframe operations:
from distributed import Client
client = Client()
Finally, if you already have an Ray or Dask engine initialized, Modin will
automatically attach to whichever engine is available. If you are interested in using
Modin with HDK engine, please refer to :doc:`these instructions </development/using_hdk>`.
For additional information on other settings you can configure, see
:doc:`Modin's config </flow/modin/config>` page for more details.

Advanced: Configuring the resources Modin uses
----------------------------------------------

Expand Down
15 changes: 15 additions & 0 deletions docs/usage_guide/advanced_usage/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Advanced Usage
modin_xgboost
modin_logging
batch
modin_engines

.. meta::
:description lang=en:
Expand All @@ -22,6 +23,16 @@ integrated toolkit for data scientists. We are actively developing data science
such as DataFrame spreadsheet integration, DataFrame algebra, progress bars, SQL queries
on DataFrames, and more. Join us on `Slack`_ and `Discourse`_ for the latest updates!

Modin engines
-------------

Modin supports a series of execution engines such as Ray_, Dask_, `MPI through unidist`_, `HDK`_,
each of which might be a more beneficial choice for a specific scenario. When doing the first operation
with Modin it automatically initializes one of the engines to further perform distributed/parallel computation.
If you are familiar with a concrete execution engine, it is possible to initialize the engine on your own and
Modin will automatically attach to it. Refer to :doc:`Modin engines </usage_guide/advanced_usage/modin_engines>` page
for more details.

Experimental APIs
-----------------

Expand Down Expand Up @@ -118,3 +129,7 @@ downloaded as an artifact from the GitHub Actions tab for further inspection. Se
.. _`tqdm`: https://github.com/tqdm/tqdm
.. _`distributed XGBoost`: https://medium.com/intel-analytics-software/distributed-xgboost-with-modin-on-ray-fc17edef7720
.. _`fuzzydata`: https://github.com/suhailrehman/fuzzydata
.. _Ray: https://github.com/ray-project/ray
.. _Dask: https://github.com/dask/distributed
.. _`MPI through unidist`: https://github.com/modin-project/unidist
.. _HDK: https://github.com/intel-ai/hdk
73 changes: 73 additions & 0 deletions docs/usage_guide/advanced_usage/modin_engines.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
Modin engines
=============

As a rule, you don't have to worry about initialization of an execution engine as
Modin itself automatically initializes one when performing the first operation.
Also, Modin has a broad range of :doc:`configuration settings </flow/modin/config>`, which
you can use to configure an execution engine. If there is a reason to initialize an execution engine
on your own and you are sure what to do, Modin will automatically attach to whichever engine is available.
Below, you can find some examples on how to initialize a specific execution engine on your own.

Ray
---

You can initialize Ray engine with a specific number of CPUs (worker processes) to perform computation.

.. code-block:: python
import ray
import modin.config as modin_cfg
ray.init(num_cpus=<N>)
modin_cfg.Engine.put("ray") # Modin will use Ray engine
To get more details on all possible parameters for initialization refer to `Ray documentation`_.

Dask
----

You can initialize Dask engine with a specific number of worker processes and threads per worker to perform computation.

.. code-block:: python
from distributed import Client
import modin.config as modin_cfg
client = Client(n_workers=<N1>, threads_per_worker=<N2>)
modin_cfg.Engine.put("dask") # # Modin will use Dask engine
To get more details on all possible parameters for initialization refer to `Dask Distributed documentation`_.

MPI through unidist
-------------------

You can initialize MPI thought unidist engine with a specific number of CPUs (worker processes) to perform computation.

.. code-block:: python
import unidist
import unidist.config as unidist_cfg
import modin.config as modin_cfg
unidist_cfg.Backend.put("mpi")
unidist_cfg.CpuCount.put(<N>)
unidist.init()
modin_cfg.Engine.put("unidist") # # Modin will use MPI through unidist engine
To get more details on all possible parameters for initialization refer to `unidist documentation`_.

HDK
---

For now it is not possible to initialize HDK beforehand. Modin itself initializes it with the required configuration.

.. code-block:: python
import modin.config as modin_cfg
modin_cfg.StorageFormat.put("hdk") # # Modin will use HDK engine
.. _`Ray documentation`: https://docs.ray.io/en/latest
.. _Dask Distributed documentation: https://distributed.dask.org/en/latest
.. _`unidist documentation`: https://unidist.readthedocs.io/en/latest
2 changes: 2 additions & 0 deletions docs/usage_guide/advanced_usage/modin_xgboost.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ To start the Ray runtime on a single node:
.. code-block:: python
import ray
# Look at the Ray documentation with respect to the Ray configuration suited to you most.
ray.init()
If you already had the Ray cluster you can connect to it by next way:
Expand All @@ -78,6 +79,7 @@ All processing will be in a `single node` mode.
from sklearn import datasets
import ray
# Look at the Ray documentation with respect to the Ray configuration suited to you most.
ray.init() # Start the Ray runtime for single-node
import modin.pandas as pd
Expand Down
2 changes: 2 additions & 0 deletions docs/usage_guide/benchmarking.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ Consider the following ipython script:
import time
import ray
# Look at the Ray documentation with respect to the Ray configuration suited to you most.
ray.init()
df = pd.DataFrame(list(range(MinPartitionSize.get() * 2)))
%time result = df.map(lambda x: time.sleep(0.1) or x)
Expand Down Expand Up @@ -146,6 +147,7 @@ That will typically block on any asynchronous computation:
time.sleep(10)
return x + 1
# Look at the Ray documentation with respect to the Ray configuration suited to you most.
ray.init()
df1 = pd.DataFrame(list(range(10_000)), columns=['A'])
result = df1.map(slow_add_one)
Expand Down
Loading

0 comments on commit 0cec706

Please sign in to comment.