Merge pull request #244 from roscisz/develop

r0.3.3
roscisz · Mar 10, 2020 · 9ba4349 · 9ba4349
2 parents a100701 + a08951d
commit 9ba4349
Show file tree

Hide file tree

Showing 18 changed files with 286 additions and 57 deletions.
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 TensorHive
 ===
-![](https://img.shields.io/badge/release-v0.3.2-brightgreen.svg?style=popout-square)
-![](https://img.shields.io/badge/pypi-v0.3.2-brightgreen.svg?style=popout-square)
+![](https://img.shields.io/badge/release-v0.3.3-brightgreen.svg?style=popout-square)
+![](https://img.shields.io/badge/pypi-v0.3.3-brightgreen.svg?style=popout-square)
 ![](https://img.shields.io/badge/Issues%20and%20PRs-welcome-yellow.svg?style=popout-square)
 ![](https://img.shields.io/badge/platform-Linux-blue.svg?style=popout-square)
 ![](https://img.shields.io/badge/hardware-Nvidia-green.svg?style=popout-square)
@@ -90,7 +90,7 @@ tensorhive test
 ```
 
 (optional) If you want to allow your UNIX users to set up their TensorHive accounts on their own and run distributed
-programs through `Task nursery` plugin, use the `key` command to generate the SSH key for TensorHive: 
+programs through `Task execution` plugin, use the `key` command to generate the SSH key for TensorHive: 
 ```
 tensorhive key
 ```
@@ -135,15 +135,22 @@ Terminal warning             |  Email warning
 
 ![image](https://raw.githubusercontent.com/roscisz/TensorHive/master/images/admin_warning_screenshot.png)
 
-#### Task nursery
+#### Task execution
 
-Here you can define commands for tasks you want to run on any configured nodes. You can manage them manually or set spawn/terminate date.
+Thanks to the `Task execution` module, you can define commands for tasks you want to run on any configured nodes.
+You can manage them manually or set spawn/terminate date.
 Commands are run within `screen` session, so attaching to it while they are running is a piece of cake.
-![image](https://raw.githubusercontent.com/roscisz/TensorHive/master/images/task_nursery_screenshot1.png)
 
-It provides quite simple, but flexible (**framework-agnostic**) command templating mechanism that will help you automate multi-node trainings.
+It provides a simple, but flexible (**framework-agnostic**) command templating mechanism that will help you automate multi-node trainings.
+Additionally, specialized templates help to conveniently set proper parameters for chosen well known frameworks:
+
+![image](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/TF_CONFIG/img/multi_process.png)
+
+In the [examples](https://github.com/roscisz/TensorHive/tree/master/examples)
+directory, you will find sample scenarios of using the `Task execution` module for various
+frameworks and computing environments.
+
 TensorHive requires that users who want to use this feature must append TensorHive's public key to their `~/.ssh/authorized_keys` on all nodes they want to connect to.
-![image](https://raw.githubusercontent.com/roscisz/TensorHive/master/images/task_nursery_screenshot2.png)
 
 Features
 ----------------------
@@ -156,7 +163,7 @@ Features
     - [x] :warning:	Send warning messages to terminal of users who violate the rules
     - [x] :mailbox_with_no_mail: Send e-mail warnings
     - [ ] :bomb: Kill unwanted processes
-- [X] :rocket: Task nursery and scheduling
+- [X] :rocket: Task execution and scheduling
     - [x] :old_key: Execute any command in the name of a user
     - [x] :alarm_clock: Schedule spawn and termination
     - [x] :repeat: Synchronize process status
@@ -178,7 +185,7 @@ Features
     - [x] Edit reservations
     - [x] Cancel reservations
     - [x] Attach jobs to reservation
-- [x] :baby_symbol: Task nursery
+- [x] :baby_symbol: Task execution
     - [x] Create parametrized tasks and assign to hosts, automatically set `CUDA_VISIBLE_DEVICES`
     - [x] Buttons for task spawning/scheduling/termination/killing actions
     - [x] Fetch log produced by running task
@@ -204,16 +211,11 @@ TensorHive is currently being used in production in the following environments:
 
 | Organization  | Hardware | No. users |
 | ------ | -------- | --------- |
-| ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Gdansk University of Technology](https://eti.pg.edu.pl/en) | NVIDIA DGX Station (4x Tesla V100 16GB | 30+ |
-| ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Lab at GUT](https://eti.pg.edu.pl/katedra-architektury-systemow-komputerowych/main) | 18x machines with GTX 1060 6GB | 20+ |
-| ![](http://martyniak.tech/images/gradient_logo_small-628ed211.png)[Gradient PG](http://gradient.eti.pg.gda.pl/en/) | TITAN X 12GB | 10+ |
-| ![](https://res-4.cloudinary.com/crunchbase-production/image/upload/c_lpad,h_20,w_20,f_auto,q_auto:eco/v1444894092/jeuh0l6opc159e1ltzky.png) [VoiceLab - Conversational Intelligence](https://www.voicelab.ai) | 30+ GTX and RTX cards | 10+
-
-Application examples and benchmarks
---------
-Along with TensorHive, we are developing a set of [**sample deep neural network training applications**](https://github.com/roscisz/TensorHive/tree/master/examples) in Distributed TensorFlow which will be used as test applications for the system. They can also serve as benchmarks for single GPU, distributed multi-GPU and distributed multi-node architectures. For each example, a full set of instructions to reproduce is provided.
+| ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Gdansk University of Technology](https://eti.pg.edu.pl/en) | NVIDIA DGX Station (4x Tesla V100) + NVIDIA DGX-1 (8x Tesla V100) | 30+ |
+| ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Lab at GUT](https://eti.pg.edu.pl/katedra-architektury-systemow-komputerowych/main) | 20 machines with GTX 1060 each | 20+ |
+| <img src="http://gradient.eti.pg.gda.pl/assets/logo.png" width=15>[Gradient PG](http://gradient.eti.pg.gda.pl/en/) | A server with two GPUs shared by the Gradient science club at GUT. | 30+ |
+| ![](https://res-4.cloudinary.com/crunchbase-production/image/upload/c_lpad,h_20,w_20,f_auto,q_auto:eco/v1444894092/jeuh0l6opc159e1ltzky.png) [VoiceLab - Conversational Intelligence](https://www.voicelab.ai) | 30+ GTX and RTX GPUs | 10+
 
-<hr/>
 
 TensorHive architecture (simplified)
 -----------------------
@@ -223,13 +225,13 @@ This diagram will help you to grasp the rough concept of the system.
 ![TensorHive_diagram _final](https://raw.githubusercontent.com/roscisz/TensorHive/master/images/architecture.png)
 
 
-Contibution and feedback
+Contribution and feedback
 ------------------------
 We'd :heart: to collect your observations, issues and pull requests!
 
 Feel free to **report any configuration problems, we will help you**.
 
-We are working on user groups for differentiated GPU access control,
+Currently we are working on user groups for differentiated GPU access control,
 grouping tasks into jobs and process-killing reservation violation handler,
 deadline - July 2020 :shipit:, so stay tuned!
 
@@ -246,10 +248,10 @@ for parallelization of neural network training using multiple GPUs".
 
 Project created and maintained by:
 - Paweł Rościszewski [(@roscisz)](https://github.com/roscisz)
-- ![](https://avatars2.githubusercontent.com/u/12485656?s=22&v=4) [Michał Martyniak (@micmarty)](http://martyniak.me)
+- ![](https://avatars2.githubusercontent.com/u/12485656?s=22&v=4) [Michał Martyniak (@micmarty)](https://micmarty.github.io)
 - Filip Schodowski [(@filschod)](https://github.com/filschod)
 
- Recent contributions:
+ Top contributors:
 - Tomasz Menet [(@tomenet)](https://github.com/tomenet)
 - Dariusz Piotrowski [(@PiotrowskiD)](https://github.com/PiotrowskiD)
 - Karol Draszawka [(@szarakawka)](https://github.com/szarakawka)

diff --git a/examples/PyTorch/README.md b/examples/PyTorch/README.md
@@ -0,0 +1,95 @@
+# Using TensorHive for running distributed trainings in PyTorch
+
+## Detailed example description
+
+In this example we show how the TensorHive `task execution` module can be
+used for convenient configuration and execution of distributed trainings
+implemented in PyTorch. For this purpose, we run
+[this PyTorch DCGAN sample application](https://github.com/roscisz/dnn_training_benchmarks/tree/master/PyTorch_dcgan_lsun/README.md)
+in a distributed setup consisting of a NVIDIA DGX Station server `ai` and NVIDIA DGX-1 server `dl`,
+equipped with 4 and 8 NVIDIA Tesla V100 GPUs respectively.
+
+In the presented scenario, the servers were shared by a group of users using TensorHive
+and at the moment we were granted reservations for GPUs 1 and 2 on `ai` and GPUs 1 and 7 on `dl`.
+The python environment and training code were available on both nodes and
+fake training dataset was used.
+
+
+## Running without TensorHive
+
+In order to enable networking, we had to set the `GLOO_SOCKET_IFNAME`
+environment variable to proper network interface names on both nodes.
+We selected the 20011 TCP port for communication. 
+
+For our 4 GPU scenario, the following 4 processes had to be executed,
+taking into account setting consecutive `rank` parameters starting from 0 and the `world-size`
+parameter to 4:
+
+worker 0 on `ai`:
+```
+export CUDA_VISIBLE_DEVICES=1
+export GLOO_SOCKET_IFNAME=enp2s0f1
+./dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/venv/bin/python dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/main.py --init-method tcp://ai.eti.pg.gda.pl:20011 --backend=gloo --rank=0 --world-size=4 --dataset fake --cuda
+``` 
+
+worker 1 on `ai`:
+```
+export CUDA_VISIBLE_DEVICES=2
+export GLOO_SOCKET_IFNAME=enp2s0f1
+./dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/venv/bin/python dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/main.py --init-method tcp://ai.eti.pg.gda.pl:20011 --backend=gloo --rank=1 --world-size=4 --dataset fake --cuda
+```
+
+worker 2 on `dl`:
+```
+export CUDA_VISIBLE_DEVICES=1
+export GLOO_SOCKET_IFNAME=enp1s0f0
+./dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/venv/bin/python dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/main.py --init-method tcp://ai.eti.pg.gda.pl:20011 --backend=gloo --rank=2 --world-size=4 --dataset fake --cuda
+``` 
+
+worker 3 on ai:
+```
+export CUDA_VISIBLE_DEVICES=7
+export GLOO_SOCKET_IFNAME=enp1s0f0
+./dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/venv/bin/python dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/main.py --init-method tcp://ai.eti.pg.gda.pl:20011 --backend=gloo --rank=3 --world-size=4 --dataset fake --cuda
+``` 
+
+
+## Running with TensorHive
+
+Because running the distributed training required in our scenario means
+logging into multiple nodes, configuring environments and running processes
+with multiple, similar parameters, differing only slightly, it is a good
+use case for the TensorHive `task execution` module.
+
+To use it, first head to `Task Overview` and click on `CREATE TASKS FROM TEMPLATE`. Choose PyTorch from the drop-down list: 
+
+![choose_template](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/PyTorch/img/choose_template.png)
+
+Fill the PyTorch process template with your specific python command, command-line
+parameters and environment variables. 
+You don't need to fill in rank or world-size parameters as TensorHive will do that automatically for you:
+
+![parameters](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/PyTorch/img/parameters.png)
+
+Add as many tasks as resources you wish the code to run on using `ADD TASK` button. You can see that every parameter filled is copied to newly created tasks to save time. Adjust hostnames and resources on the added tasks as needed.
+
+![full_conf](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/PyTorch/img/full_conf.png)
+
+
+Click `CREATE ALL TASKS` button in the right bottom corner to create the tasks.
+Then, select them in the process table and use the `Spawn selected tasks` button,
+to run them on the appropriate nodes:
+
+![running](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/PyTorch/img/running.png)
+
+After that, the tasks can be controlled from `Task Overview`. 
+The following actions are currently available:
+- Schedule (Choose when to run the task)
+- Spawn (Run the task now)
+- Terminate (Send terminate command to the task)
+- Kill (Send kill command)
+- Show log (Read output of the task)
+- Edit 
+- Remove
+
+Having that high level control over all of the tasks from a single place can be extremely time-saving!
diff --git a/examples/PyTorch/img/choose_template.png b/examples/PyTorch/img/choose_template.png
diff --git a/examples/PyTorch/img/full_conf.png b/examples/PyTorch/img/full_conf.png
diff --git a/examples/PyTorch/img/parameters.png b/examples/PyTorch/img/parameters.png
diff --git a/examples/PyTorch/img/running.png b/examples/PyTorch/img/running.png
diff --git a/examples/README.md b/examples/README.md
@@ -1,16 +1,11 @@
 # Examples
-This directory contains examples of deep neural network training applications that serve as
-requirement providers towards TensorHive. We are testing their performance and ease of use depending
-on computing resource management software used. This allows us to learn about various features of
-existing resource management systems and base our design decisions on the experiences with approaches
-such as native application execution, application-specific scripts, Docker and 
-[Kubernetes](https://gist.github.com/PiotrowskiD/1e3f659a8ac7db1c2ca02ba0ae5fcfaf).  
-
-The applications can be also useful for others as benchmarks. Our benchmark results, useful resources
-and more info about configuration and running examples can be found in corresponding folders.
-We plan to add TensorHive usage examples to the individual directories when distributed training deployment
-is supported by TensorHive
-
-
-
-
+This directory contains usage examples of the TensorHive task execution
+module for chosen DNN training scenarios:
+
+Directory | Description
+--- | ---  
+TF_CONFIG | Using the default cluster configuration method in TensorFlowV2 - the TF_CONFIG environment variable.
+TensorFlow_ClusterSpec | Using the standard ClusterSpec parameters, often used in TensorFlowV1 implementations. 
+PyTorch | Using standard parameters used in PyTorch implementations.
+deepspeech | Redirection to the DeepSpeech test application, kept for link maintenance.
+t2t_transformer | Redirection to the DeepSpeech test application, kept for link maintenance.
diff --git a/examples/TF_CONFIG/README.md b/examples/TF_CONFIG/README.md
@@ -1,9 +1,9 @@
-# Using TensorHive for running distributed trainings using TF_CONFIG
+# Running distributed trainings through TensorHive using TF_CONFIG
 
-This example shows how to use the TensorHive `task nursery` module to
-conveniently orchestrate distributed trainings configured using
-the TF_CONFIG environment variable. This
-[MSG-GAN training application](https://github.com/roscisz/dnn_training_benchmarks/tree/master/TensorFlowV2_MSG-GAN_Fashion-MNIST)
+This example shows how to use the TensorHive `task execution` module to
+conveniently configure and execute distributed trainings using
+the TF_CONFIG environment variable. 
+[This MSG-GAN training application](https://github.com/roscisz/dnn_training_benchmarks/tree/master/TensorFlowV2_MSG-GAN_Fashion-MNIST)
 was used for the example.
 
 ## Running the training without TensorHive
@@ -31,7 +31,9 @@ TF_CONFIG='{"cluster":{"worker":["gl01:2222", "gl02:2222"]}, "task":{"type": "wo
 **Other environment variables**
 
 Depending on the environment, some other environment variables may have to be configured.
-For example, because our TensorFlow compilation uses a custom MPI library, the LD_LIBRARY_PATH environment
+For example, in multi-GPU nodes, setting proper value of the CUDA_VISIBLE_DEVICES is useful
+to prevent the process from needlessly utilizing GPU memory. In this example,
+because the utilized TensorFlow compilation uses a custom MPI library, the LD_LIBRARY_PATH environment
 variable has to be set for each process to /usr/mpi/gcc/openmpi-4.0.0rc5/lib/.
 
 **Choosing the appropriate Python version**
@@ -46,7 +48,7 @@ is used, so the python binary has to be defined as follows:
 
 **Summary**
 
-Finally, full commands required to start in the exemplary setup our environment, will be as follows:
+Finally, full commands required to launch the training in our exemplary environment will be as follows:
 
 gl01:
 
@@ -67,12 +69,12 @@ export LD_LIBRARY_PATH='/usr/mpi/gcc/openmpi-4.0.0rc5/lib/'
 
 ## Running the training with TensorHive
 
-The TensorHive `task nursery` module allows convenient orchestration of distributed trainings.
+The TensorHive `task execution` module allows convenient orchestration of distributed trainings.
 It is available in the Tasks Overview view. The `CREATE TASKS FROM TEMPLATE` button allows to
 conveniently configure tasks supporting a specific framework or distribution method. In this
 example we choose the Tensorflow - TF_CONFIG template, and click `GO TO TASK CREATOR`:
 
-![choose_template](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/choose_template.png)
+![choose_template](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/TF_CONFIG/img/choose_template.png)
 
 In the task creator, we set the Command to
 ```
@@ -82,7 +84,7 @@ In the task creator, we set the Command to
 In order to add the LD_LIBRARY_PATH environment variable, we enter the parameter name,
 select Static (the same value for all processes) and click `ADD AS ENV VARIABLE TO ALL TASKS`:
 
-![env_var](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/env_var.png)
+![env_var](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/TF_CONFIG/img/env_var.png)
 
 Then, set the appropriate value of the environment variable (/usr/mpi/gcc/openmpi-4.0.0rc5/lib/).
 
@@ -93,20 +95,20 @@ to specify batch size, we enter parameter name --batch_size, again select Static
 Select the required hostname and resource (CPU/GPU_N) for the specified training process. The resultant
 command that will be executed by TensorHive on the selected node will be displayed above the process specification:
 
-![single_process](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/single_process.png)
+![single_process](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/TF_CONFIG/img/single_process.png)
 
 Note that the TF_CONFIG and CUDA_VISIBLE_DEVICES variables are configured automatically. Now, use
 the `ADD TASK` button to duplicate the processes and modify the required target hosts to create
 your training processes. For example, this screenshot shows the configuration for training on 4
 hosts: gl01, gl02, gl03, gl04:
 
-![multi_process](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/multi_process.png)
+![multi_process](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/TF_CONFIG/img/multi_process.png)
 
-After clicking the `CREATE ALL TASKS` button, the processes will be available in the process list for future actions.
+After clicking the `CREATE ALL TASKS` button, the processes will be available on the process list for future actions.
 To run the processes, select them and use the `Spawn selected tasks` button. If TensorHive is configured properly,
 the task status should change to `running`:
 
-![running](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/multi_process.png)
+![running](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/TF_CONFIG/img/running.png)
 
 Note, that the appropriate process PID will be displayed in the `pid` column. Task overview can
 be used to schedule, spawn, stop, kill, edit the tasks, and see logs from their execution.