Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: added send_recv docs #62

Merged
merged 1 commit into from
Aug 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 2 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,60 +64,8 @@ m8d-post-setup

## Running Examples

* [`m8d.py`](/examples/send_recv/m8d.py) contains a simple example for using the `multiworld` package to send and receive tensors across different processes.
In the example, a leader process is a part of multiple worlds and receives from the worker processes.
Script can be run using the following commands.

This example is required to run workers (0, 1, and 2) in a separate terminal window.
The lead worker needs to be executed with two world 1 and 2, with the rank of 0
The child workers must match the world index of the lead worker and the rank of 1.
`--worldinfo` argument is composed by the world index and the rank of the worker in that world.
(e.g. `--worldinfo 1,0` means that the worker will have rank `0` in the world with the index `1`)
The script can be executed in a single host or across hosts.
To run processes on different hosts, `--addr` arugment can be used.
For example, run the following commands, by changing the IP address (10.20.1.50) correctly in your setting.

```bash
# on terminal window 1
python m8d.py --backend nccl --worldinfo 1,0 --worldinfo 2,0 --addr 10.20.1.50
# on terminal window 2
python m8d.py --backend nccl --worldinfo 1,1 --addr 10.20.1.50
# on terminal window 3
python m8d.py --backend nccl --worldinfo 2,1 --addr 10.20.1.50
```

Here the IP address is the IP address of rank 0. We assume that at least 3 GPUs are available either in a single host or across hosts.
If the scripts are executed in a single host, `--addr` can be omitted.

While running the above example, one can terminate a worker (e.g., rank 2) and the leader (rank 0) continues to receive tensors from the remaining worker.

`MultiWorld` facilitates fault management functionality at a worker level, meaning that it can detect, tolerate and recover faults that are occuring at a worker in a host.
So, one can run the above example in a single host or across hosts. For the cross-host execution, the IP address must be the IP address of rank 0.

* [`single_world.py`](/examples/single_world.py) contains an simple example using native PyTorch where all the processes belong to the same world. Script can be run using the following commands.

For running all processes on the same host, run the command:

```bash
python single_world.py --backend nccl --worldsize 3
```

For running processes on different hosts, at least two hosts are needed.
For example, run the following commands for a two host setting:

```bash
# on host 1
python single_world.py --backend nccl --addr 10.20.1.50 --multihost --worldsize 3 --rank 0
# on host 2
python single_world.py --backend nccl --addr 10.20.1.50 --multihost --worldsize 3 --rank 1
# on host 2
python single_world.py --backend nccl --addr 10.20.1.50 --multihost --worldsize 3 --rank 2
```

In this example, terminating one worker (e.g., rank 2) will terminate all the workers in the process group.
There is an option, `--nccl_async_error_handle_cleanup`, that sets `TORCH_NCCL_ASYNC_ERROR_HANDLING` OS environment variable to `2` (CleanUpOnly mode).
Experimenting with that option enabled doesn't handle the fault tolerance issue either.
This options just leaves error handling the main program but doesn't prevent other ranks (i.e., 0 and 1) from aborting NCCL's communicator.
The list of all examples that are available can be found in the [`examples`](/examples) folder.
We recommend to start with [`send_recv`](/examples/send_recv) example

## Generating Documentation

Expand Down
51 changes: 51 additions & 0 deletions examples/send_recv/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Send and Recieve

* [`m8d.py`](/examples/send_recv/m8d.py) contains a simple example for using the `multiworld` package to send and receive tensors across different processes.
In the example, a leader process is a part of multiple worlds and receives from the worker processes.
Script can be run using the following commands.

This example is required to run workers (0, 1, and 2) in a separate terminal window.
The lead worker needs to be executed with two world 1 and 2, with the rank of 0
The child workers must match the world index of the lead worker and the rank of 1.
`--worldinfo` argument is composed by the world index and the rank of the worker in that world.
(e.g. `--worldinfo 1,0` means that the worker will have rank `0` in the world with the index `1`)
The script can be executed in a single host or across hosts.
To run processes on different hosts, `--addr` arugment can be used.
For example, run the following commands, by changing the IP address (10.20.1.50) correctly in your setting.

```bash
# on terminal window 1
python m8d.py --backend nccl --worldinfo 1,0 --worldinfo 2,0 --addr 10.20.1.50
# on terminal window 2
python m8d.py --backend nccl --worldinfo 1,1 --addr 10.20.1.50
# on terminal window 3
python m8d.py --backend nccl --worldinfo 2,1 --addr 10.20.1.50
```

Here the IP address is the IP address of rank 0. We assume that at least 3 GPUs are available either in a single host or across hosts.
If the scripts are executed in a single host, `--addr` can be omitted.

While running the above example, one can terminate a worker (e.g., rank 2) and the leader (rank 0) continues to receive tensors from the remaining worker.

`MultiWorld` facilitates fault management functionality at a worker level, meaning that it can detect, tolerate and recover faults that are occuring at a worker in a host.
So, one can run the above example in a single host or across hosts. For the cross-host execution, the IP address must be the IP address of rank 0.

* [`single_world.py`](/examples/single_world.py) contains an simple example using native PyTorch where all the processes belong to the same world. Script can be run using the following commands.

For running all processes on the same host, run the command:

```bash
python single_world.py --backend nccl --worldsize 3
```

For running processes on different hosts, at least two hosts are needed.
For example, run the following commands for a two host setting:

```bash
# on host 1
python single_world.py --backend nccl --addr 10.20.1.50 --multihost --worldsize 3 --rank 0
# on host 2
python single_world.py --backend nccl --addr 10.20.1.50 --multihost --worldsize 3 --rank 1
# on host 2
python single_world.py --backend nccl --addr 10.20.1.50 --multihost --worldsize 3 --rank 2
```