Skip to content

Commit

Permalink
Merge pull request #244 from roscisz/develop
Browse files Browse the repository at this point in the history
r0.3.3
  • Loading branch information
roscisz authored Mar 10, 2020
2 parents a100701 + a08951d commit 9ba4349
Show file tree
Hide file tree
Showing 18 changed files with 286 additions and 57 deletions.
48 changes: 25 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
TensorHive
===
![](https://img.shields.io/badge/release-v0.3.2-brightgreen.svg?style=popout-square)
![](https://img.shields.io/badge/pypi-v0.3.2-brightgreen.svg?style=popout-square)
![](https://img.shields.io/badge/release-v0.3.3-brightgreen.svg?style=popout-square)
![](https://img.shields.io/badge/pypi-v0.3.3-brightgreen.svg?style=popout-square)
![](https://img.shields.io/badge/Issues%20and%20PRs-welcome-yellow.svg?style=popout-square)
![](https://img.shields.io/badge/platform-Linux-blue.svg?style=popout-square)
![](https://img.shields.io/badge/hardware-Nvidia-green.svg?style=popout-square)
Expand Down Expand Up @@ -90,7 +90,7 @@ tensorhive test
```

(optional) If you want to allow your UNIX users to set up their TensorHive accounts on their own and run distributed
programs through `Task nursery` plugin, use the `key` command to generate the SSH key for TensorHive:
programs through `Task execution` plugin, use the `key` command to generate the SSH key for TensorHive:
```
tensorhive key
```
Expand Down Expand Up @@ -135,15 +135,22 @@ Terminal warning | Email warning

![image](https://raw.githubusercontent.com/roscisz/TensorHive/master/images/admin_warning_screenshot.png)

#### Task nursery
#### Task execution

Here you can define commands for tasks you want to run on any configured nodes. You can manage them manually or set spawn/terminate date.
Thanks to the `Task execution` module, you can define commands for tasks you want to run on any configured nodes.
You can manage them manually or set spawn/terminate date.
Commands are run within `screen` session, so attaching to it while they are running is a piece of cake.
![image](https://raw.githubusercontent.com/roscisz/TensorHive/master/images/task_nursery_screenshot1.png)

It provides quite simple, but flexible (**framework-agnostic**) command templating mechanism that will help you automate multi-node trainings.
It provides a simple, but flexible (**framework-agnostic**) command templating mechanism that will help you automate multi-node trainings.
Additionally, specialized templates help to conveniently set proper parameters for chosen well known frameworks:

![image](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/TF_CONFIG/img/multi_process.png)

In the [examples](https://github.com/roscisz/TensorHive/tree/master/examples)
directory, you will find sample scenarios of using the `Task execution` module for various
frameworks and computing environments.

TensorHive requires that users who want to use this feature must append TensorHive's public key to their `~/.ssh/authorized_keys` on all nodes they want to connect to.
![image](https://raw.githubusercontent.com/roscisz/TensorHive/master/images/task_nursery_screenshot2.png)

Features
----------------------
Expand All @@ -156,7 +163,7 @@ Features
- [x] :warning: Send warning messages to terminal of users who violate the rules
- [x] :mailbox_with_no_mail: Send e-mail warnings
- [ ] :bomb: Kill unwanted processes
- [X] :rocket: Task nursery and scheduling
- [X] :rocket: Task execution and scheduling
- [x] :old_key: Execute any command in the name of a user
- [x] :alarm_clock: Schedule spawn and termination
- [x] :repeat: Synchronize process status
Expand All @@ -178,7 +185,7 @@ Features
- [x] Edit reservations
- [x] Cancel reservations
- [x] Attach jobs to reservation
- [x] :baby_symbol: Task nursery
- [x] :baby_symbol: Task execution
- [x] Create parametrized tasks and assign to hosts, automatically set `CUDA_VISIBLE_DEVICES`
- [x] Buttons for task spawning/scheduling/termination/killing actions
- [x] Fetch log produced by running task
Expand All @@ -204,16 +211,11 @@ TensorHive is currently being used in production in the following environments:

| Organization | Hardware | No. users |
| ------ | -------- | --------- |
| ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Gdansk University of Technology](https://eti.pg.edu.pl/en) | NVIDIA DGX Station (4x Tesla V100 16GB | 30+ |
| ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Lab at GUT](https://eti.pg.edu.pl/katedra-architektury-systemow-komputerowych/main) | 18x machines with GTX 1060 6GB | 20+ |
| ![](http://martyniak.tech/images/gradient_logo_small-628ed211.png)[Gradient PG](http://gradient.eti.pg.gda.pl/en/) | TITAN X 12GB | 10+ |
| ![](https://res-4.cloudinary.com/crunchbase-production/image/upload/c_lpad,h_20,w_20,f_auto,q_auto:eco/v1444894092/jeuh0l6opc159e1ltzky.png) [VoiceLab - Conversational Intelligence](https://www.voicelab.ai) | 30+ GTX and RTX cards | 10+

Application examples and benchmarks
--------
Along with TensorHive, we are developing a set of [**sample deep neural network training applications**](https://github.com/roscisz/TensorHive/tree/master/examples) in Distributed TensorFlow which will be used as test applications for the system. They can also serve as benchmarks for single GPU, distributed multi-GPU and distributed multi-node architectures. For each example, a full set of instructions to reproduce is provided.
| ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Gdansk University of Technology](https://eti.pg.edu.pl/en) | NVIDIA DGX Station (4x Tesla V100) + NVIDIA DGX-1 (8x Tesla V100) | 30+ |
| ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Lab at GUT](https://eti.pg.edu.pl/katedra-architektury-systemow-komputerowych/main) | 20 machines with GTX 1060 each | 20+ |
| <img src="http://gradient.eti.pg.gda.pl/assets/logo.png" width=15>[Gradient PG](http://gradient.eti.pg.gda.pl/en/) | A server with two GPUs shared by the Gradient science club at GUT. | 30+ |
| ![](https://res-4.cloudinary.com/crunchbase-production/image/upload/c_lpad,h_20,w_20,f_auto,q_auto:eco/v1444894092/jeuh0l6opc159e1ltzky.png) [VoiceLab - Conversational Intelligence](https://www.voicelab.ai) | 30+ GTX and RTX GPUs | 10+

<hr/>

TensorHive architecture (simplified)
-----------------------
Expand All @@ -223,13 +225,13 @@ This diagram will help you to grasp the rough concept of the system.
![TensorHive_diagram _final](https://raw.githubusercontent.com/roscisz/TensorHive/master/images/architecture.png)


Contibution and feedback
Contribution and feedback
------------------------
We'd :heart: to collect your observations, issues and pull requests!

Feel free to **report any configuration problems, we will help you**.

We are working on user groups for differentiated GPU access control,
Currently we are working on user groups for differentiated GPU access control,
grouping tasks into jobs and process-killing reservation violation handler,
deadline - July 2020 :shipit:, so stay tuned!

Expand All @@ -246,10 +248,10 @@ for parallelization of neural network training using multiple GPUs".

Project created and maintained by:
- Paweł Rościszewski [(@roscisz)](https://github.com/roscisz)
- ![](https://avatars2.githubusercontent.com/u/12485656?s=22&v=4) [Michał Martyniak (@micmarty)](http://martyniak.me)
- ![](https://avatars2.githubusercontent.com/u/12485656?s=22&v=4) [Michał Martyniak (@micmarty)](https://micmarty.github.io)
- Filip Schodowski [(@filschod)](https://github.com/filschod)

Recent contributions:
Top contributors:
- Tomasz Menet [(@tomenet)](https://github.com/tomenet)
- Dariusz Piotrowski [(@PiotrowskiD)](https://github.com/PiotrowskiD)
- Karol Draszawka [(@szarakawka)](https://github.com/szarakawka)
Expand Down
95 changes: 95 additions & 0 deletions examples/PyTorch/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Using TensorHive for running distributed trainings in PyTorch

## Detailed example description

In this example we show how the TensorHive `task execution` module can be
used for convenient configuration and execution of distributed trainings
implemented in PyTorch. For this purpose, we run
[this PyTorch DCGAN sample application](https://github.com/roscisz/dnn_training_benchmarks/tree/master/PyTorch_dcgan_lsun/README.md)
in a distributed setup consisting of a NVIDIA DGX Station server `ai` and NVIDIA DGX-1 server `dl`,
equipped with 4 and 8 NVIDIA Tesla V100 GPUs respectively.

In the presented scenario, the servers were shared by a group of users using TensorHive
and at the moment we were granted reservations for GPUs 1 and 2 on `ai` and GPUs 1 and 7 on `dl`.
The python environment and training code were available on both nodes and
fake training dataset was used.


## Running without TensorHive

In order to enable networking, we had to set the `GLOO_SOCKET_IFNAME`
environment variable to proper network interface names on both nodes.
We selected the 20011 TCP port for communication.

For our 4 GPU scenario, the following 4 processes had to be executed,
taking into account setting consecutive `rank` parameters starting from 0 and the `world-size`
parameter to 4:

worker 0 on `ai`:
```
export CUDA_VISIBLE_DEVICES=1
export GLOO_SOCKET_IFNAME=enp2s0f1
./dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/venv/bin/python dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/main.py --init-method tcp://ai.eti.pg.gda.pl:20011 --backend=gloo --rank=0 --world-size=4 --dataset fake --cuda
```

worker 1 on `ai`:
```
export CUDA_VISIBLE_DEVICES=2
export GLOO_SOCKET_IFNAME=enp2s0f1
./dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/venv/bin/python dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/main.py --init-method tcp://ai.eti.pg.gda.pl:20011 --backend=gloo --rank=1 --world-size=4 --dataset fake --cuda
```

worker 2 on `dl`:
```
export CUDA_VISIBLE_DEVICES=1
export GLOO_SOCKET_IFNAME=enp1s0f0
./dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/venv/bin/python dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/main.py --init-method tcp://ai.eti.pg.gda.pl:20011 --backend=gloo --rank=2 --world-size=4 --dataset fake --cuda
```

worker 3 on ai:
```
export CUDA_VISIBLE_DEVICES=7
export GLOO_SOCKET_IFNAME=enp1s0f0
./dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/venv/bin/python dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/main.py --init-method tcp://ai.eti.pg.gda.pl:20011 --backend=gloo --rank=3 --world-size=4 --dataset fake --cuda
```


## Running with TensorHive

Because running the distributed training required in our scenario means
logging into multiple nodes, configuring environments and running processes
with multiple, similar parameters, differing only slightly, it is a good
use case for the TensorHive `task execution` module.

To use it, first head to `Task Overview` and click on `CREATE TASKS FROM TEMPLATE`. Choose PyTorch from the drop-down list:

![choose_template](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/PyTorch/img/choose_template.png)

Fill the PyTorch process template with your specific python command, command-line
parameters and environment variables.
You don't need to fill in rank or world-size parameters as TensorHive will do that automatically for you:

![parameters](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/PyTorch/img/parameters.png)

Add as many tasks as resources you wish the code to run on using `ADD TASK` button. You can see that every parameter filled is copied to newly created tasks to save time. Adjust hostnames and resources on the added tasks as needed.

![full_conf](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/PyTorch/img/full_conf.png)


Click `CREATE ALL TASKS` button in the right bottom corner to create the tasks.
Then, select them in the process table and use the `Spawn selected tasks` button,
to run them on the appropriate nodes:

![running](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/PyTorch/img/running.png)

After that, the tasks can be controlled from `Task Overview`.
The following actions are currently available:
- Schedule (Choose when to run the task)
- Spawn (Run the task now)
- Terminate (Send terminate command to the task)
- Kill (Send kill command)
- Show log (Read output of the task)
- Edit
- Remove

Having that high level control over all of the tasks from a single place can be extremely time-saving!
Binary file added examples/PyTorch/img/choose_template.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/PyTorch/img/full_conf.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/PyTorch/img/parameters.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/PyTorch/img/running.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
25 changes: 10 additions & 15 deletions examples/README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,11 @@
# Examples
This directory contains examples of deep neural network training applications that serve as
requirement providers towards TensorHive. We are testing their performance and ease of use depending
on computing resource management software used. This allows us to learn about various features of
existing resource management systems and base our design decisions on the experiences with approaches
such as native application execution, application-specific scripts, Docker and
[Kubernetes](https://gist.github.com/PiotrowskiD/1e3f659a8ac7db1c2ca02ba0ae5fcfaf).

The applications can be also useful for others as benchmarks. Our benchmark results, useful resources
and more info about configuration and running examples can be found in corresponding folders.
We plan to add TensorHive usage examples to the individual directories when distributed training deployment
is supported by TensorHive




This directory contains usage examples of the TensorHive task execution
module for chosen DNN training scenarios:

Directory | Description
--- | ---
TF_CONFIG | Using the default cluster configuration method in TensorFlowV2 - the TF_CONFIG environment variable.
TensorFlow_ClusterSpec | Using the standard ClusterSpec parameters, often used in TensorFlowV1 implementations.
PyTorch | Using standard parameters used in PyTorch implementations.
deepspeech | Redirection to the DeepSpeech test application, kept for link maintenance.
t2t_transformer | Redirection to the DeepSpeech test application, kept for link maintenance.
30 changes: 16 additions & 14 deletions examples/TF_CONFIG/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Using TensorHive for running distributed trainings using TF_CONFIG
# Running distributed trainings through TensorHive using TF_CONFIG

This example shows how to use the TensorHive `task nursery` module to
conveniently orchestrate distributed trainings configured using
the TF_CONFIG environment variable. This
[MSG-GAN training application](https://github.com/roscisz/dnn_training_benchmarks/tree/master/TensorFlowV2_MSG-GAN_Fashion-MNIST)
This example shows how to use the TensorHive `task execution` module to
conveniently configure and execute distributed trainings using
the TF_CONFIG environment variable.
[This MSG-GAN training application](https://github.com/roscisz/dnn_training_benchmarks/tree/master/TensorFlowV2_MSG-GAN_Fashion-MNIST)
was used for the example.

## Running the training without TensorHive
Expand Down Expand Up @@ -31,7 +31,9 @@ TF_CONFIG='{"cluster":{"worker":["gl01:2222", "gl02:2222"]}, "task":{"type": "wo
**Other environment variables**

Depending on the environment, some other environment variables may have to be configured.
For example, because our TensorFlow compilation uses a custom MPI library, the LD_LIBRARY_PATH environment
For example, in multi-GPU nodes, setting proper value of the CUDA_VISIBLE_DEVICES is useful
to prevent the process from needlessly utilizing GPU memory. In this example,
because the utilized TensorFlow compilation uses a custom MPI library, the LD_LIBRARY_PATH environment
variable has to be set for each process to /usr/mpi/gcc/openmpi-4.0.0rc5/lib/.

**Choosing the appropriate Python version**
Expand All @@ -46,7 +48,7 @@ is used, so the python binary has to be defined as follows:

**Summary**

Finally, full commands required to start in the exemplary setup our environment, will be as follows:
Finally, full commands required to launch the training in our exemplary environment will be as follows:

gl01:

Expand All @@ -67,12 +69,12 @@ export LD_LIBRARY_PATH='/usr/mpi/gcc/openmpi-4.0.0rc5/lib/'

## Running the training with TensorHive

The TensorHive `task nursery` module allows convenient orchestration of distributed trainings.
The TensorHive `task execution` module allows convenient orchestration of distributed trainings.
It is available in the Tasks Overview view. The `CREATE TASKS FROM TEMPLATE` button allows to
conveniently configure tasks supporting a specific framework or distribution method. In this
example we choose the Tensorflow - TF_CONFIG template, and click `GO TO TASK CREATOR`:

![choose_template](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/choose_template.png)
![choose_template](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/TF_CONFIG/img/choose_template.png)

In the task creator, we set the Command to
```
Expand All @@ -82,7 +84,7 @@ In the task creator, we set the Command to
In order to add the LD_LIBRARY_PATH environment variable, we enter the parameter name,
select Static (the same value for all processes) and click `ADD AS ENV VARIABLE TO ALL TASKS`:

![env_var](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/env_var.png)
![env_var](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/TF_CONFIG/img/env_var.png)

Then, set the appropriate value of the environment variable (/usr/mpi/gcc/openmpi-4.0.0rc5/lib/).

Expand All @@ -93,20 +95,20 @@ to specify batch size, we enter parameter name --batch_size, again select Static
Select the required hostname and resource (CPU/GPU_N) for the specified training process. The resultant
command that will be executed by TensorHive on the selected node will be displayed above the process specification:

![single_process](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/single_process.png)
![single_process](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/TF_CONFIG/img/single_process.png)

Note that the TF_CONFIG and CUDA_VISIBLE_DEVICES variables are configured automatically. Now, use
the `ADD TASK` button to duplicate the processes and modify the required target hosts to create
your training processes. For example, this screenshot shows the configuration for training on 4
hosts: gl01, gl02, gl03, gl04:

![multi_process](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/multi_process.png)
![multi_process](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/TF_CONFIG/img/multi_process.png)

After clicking the `CREATE ALL TASKS` button, the processes will be available in the process list for future actions.
After clicking the `CREATE ALL TASKS` button, the processes will be available on the process list for future actions.
To run the processes, select them and use the `Spawn selected tasks` button. If TensorHive is configured properly,
the task status should change to `running`:

![running](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/multi_process.png)
![running](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/TF_CONFIG/img/running.png)

Note, that the appropriate process PID will be displayed in the `pid` column. Task overview can
be used to schedule, spawn, stop, kill, edit the tasks, and see logs from their execution.
Loading

0 comments on commit 9ba4349

Please sign in to comment.