diff --git a/README.md b/README.md index 32303b4a..b2b018c9 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ TensorHive === -![](https://img.shields.io/badge/release-v0.3.2-brightgreen.svg?style=popout-square) -![](https://img.shields.io/badge/pypi-v0.3.2-brightgreen.svg?style=popout-square) +![](https://img.shields.io/badge/release-v0.3.3-brightgreen.svg?style=popout-square) +![](https://img.shields.io/badge/pypi-v0.3.3-brightgreen.svg?style=popout-square) ![](https://img.shields.io/badge/Issues%20and%20PRs-welcome-yellow.svg?style=popout-square) ![](https://img.shields.io/badge/platform-Linux-blue.svg?style=popout-square) ![](https://img.shields.io/badge/hardware-Nvidia-green.svg?style=popout-square) @@ -90,7 +90,7 @@ tensorhive test ``` (optional) If you want to allow your UNIX users to set up their TensorHive accounts on their own and run distributed -programs through `Task nursery` plugin, use the `key` command to generate the SSH key for TensorHive: +programs through `Task execution` plugin, use the `key` command to generate the SSH key for TensorHive: ``` tensorhive key ``` @@ -135,15 +135,22 @@ Terminal warning | Email warning ![image](https://raw.githubusercontent.com/roscisz/TensorHive/master/images/admin_warning_screenshot.png) -#### Task nursery +#### Task execution -Here you can define commands for tasks you want to run on any configured nodes. You can manage them manually or set spawn/terminate date. +Thanks to the `Task execution` module, you can define commands for tasks you want to run on any configured nodes. +You can manage them manually or set spawn/terminate date. Commands are run within `screen` session, so attaching to it while they are running is a piece of cake. -![image](https://raw.githubusercontent.com/roscisz/TensorHive/master/images/task_nursery_screenshot1.png) -It provides quite simple, but flexible (**framework-agnostic**) command templating mechanism that will help you automate multi-node trainings. +It provides a simple, but flexible (**framework-agnostic**) command templating mechanism that will help you automate multi-node trainings. +Additionally, specialized templates help to conveniently set proper parameters for chosen well known frameworks: + +![image](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/TF_CONFIG/img/multi_process.png) + +In the [examples](https://github.com/roscisz/TensorHive/tree/master/examples) +directory, you will find sample scenarios of using the `Task execution` module for various +frameworks and computing environments. + TensorHive requires that users who want to use this feature must append TensorHive's public key to their `~/.ssh/authorized_keys` on all nodes they want to connect to. -![image](https://raw.githubusercontent.com/roscisz/TensorHive/master/images/task_nursery_screenshot2.png) Features ---------------------- @@ -156,7 +163,7 @@ Features - [x] :warning: Send warning messages to terminal of users who violate the rules - [x] :mailbox_with_no_mail: Send e-mail warnings - [ ] :bomb: Kill unwanted processes -- [X] :rocket: Task nursery and scheduling +- [X] :rocket: Task execution and scheduling - [x] :old_key: Execute any command in the name of a user - [x] :alarm_clock: Schedule spawn and termination - [x] :repeat: Synchronize process status @@ -178,7 +185,7 @@ Features - [x] Edit reservations - [x] Cancel reservations - [x] Attach jobs to reservation -- [x] :baby_symbol: Task nursery +- [x] :baby_symbol: Task execution - [x] Create parametrized tasks and assign to hosts, automatically set `CUDA_VISIBLE_DEVICES` - [x] Buttons for task spawning/scheduling/termination/killing actions - [x] Fetch log produced by running task @@ -204,16 +211,11 @@ TensorHive is currently being used in production in the following environments: | Organization | Hardware | No. users | | ------ | -------- | --------- | -| ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Gdansk University of Technology](https://eti.pg.edu.pl/en) | NVIDIA DGX Station (4x Tesla V100 16GB | 30+ | -| ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Lab at GUT](https://eti.pg.edu.pl/katedra-architektury-systemow-komputerowych/main) | 18x machines with GTX 1060 6GB | 20+ | -| ![](http://martyniak.tech/images/gradient_logo_small-628ed211.png)[Gradient PG](http://gradient.eti.pg.gda.pl/en/) | TITAN X 12GB | 10+ | -| ![](https://res-4.cloudinary.com/crunchbase-production/image/upload/c_lpad,h_20,w_20,f_auto,q_auto:eco/v1444894092/jeuh0l6opc159e1ltzky.png) [VoiceLab - Conversational Intelligence](https://www.voicelab.ai) | 30+ GTX and RTX cards | 10+ - -Application examples and benchmarks --------- -Along with TensorHive, we are developing a set of [**sample deep neural network training applications**](https://github.com/roscisz/TensorHive/tree/master/examples) in Distributed TensorFlow which will be used as test applications for the system. They can also serve as benchmarks for single GPU, distributed multi-GPU and distributed multi-node architectures. For each example, a full set of instructions to reproduce is provided. +| ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Gdansk University of Technology](https://eti.pg.edu.pl/en) | NVIDIA DGX Station (4x Tesla V100) + NVIDIA DGX-1 (8x Tesla V100) | 30+ | +| ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Lab at GUT](https://eti.pg.edu.pl/katedra-architektury-systemow-komputerowych/main) | 20 machines with GTX 1060 each | 20+ | +| [Gradient PG](http://gradient.eti.pg.gda.pl/en/) | A server with two GPUs shared by the Gradient science club at GUT. | 30+ | +| ![](https://res-4.cloudinary.com/crunchbase-production/image/upload/c_lpad,h_20,w_20,f_auto,q_auto:eco/v1444894092/jeuh0l6opc159e1ltzky.png) [VoiceLab - Conversational Intelligence](https://www.voicelab.ai) | 30+ GTX and RTX GPUs | 10+ -
TensorHive architecture (simplified) ----------------------- @@ -223,13 +225,13 @@ This diagram will help you to grasp the rough concept of the system. ![TensorHive_diagram _final](https://raw.githubusercontent.com/roscisz/TensorHive/master/images/architecture.png) -Contibution and feedback +Contribution and feedback ------------------------ We'd :heart: to collect your observations, issues and pull requests! Feel free to **report any configuration problems, we will help you**. -We are working on user groups for differentiated GPU access control, +Currently we are working on user groups for differentiated GPU access control, grouping tasks into jobs and process-killing reservation violation handler, deadline - July 2020 :shipit:, so stay tuned! @@ -246,10 +248,10 @@ for parallelization of neural network training using multiple GPUs". Project created and maintained by: - Paweł Rościszewski [(@roscisz)](https://github.com/roscisz) -- ![](https://avatars2.githubusercontent.com/u/12485656?s=22&v=4) [Michał Martyniak (@micmarty)](http://martyniak.me) +- ![](https://avatars2.githubusercontent.com/u/12485656?s=22&v=4) [Michał Martyniak (@micmarty)](https://micmarty.github.io) - Filip Schodowski [(@filschod)](https://github.com/filschod) - Recent contributions: + Top contributors: - Tomasz Menet [(@tomenet)](https://github.com/tomenet) - Dariusz Piotrowski [(@PiotrowskiD)](https://github.com/PiotrowskiD) - Karol Draszawka [(@szarakawka)](https://github.com/szarakawka) diff --git a/examples/PyTorch/README.md b/examples/PyTorch/README.md new file mode 100644 index 00000000..2b216e89 --- /dev/null +++ b/examples/PyTorch/README.md @@ -0,0 +1,95 @@ +# Using TensorHive for running distributed trainings in PyTorch + +## Detailed example description + +In this example we show how the TensorHive `task execution` module can be +used for convenient configuration and execution of distributed trainings +implemented in PyTorch. For this purpose, we run +[this PyTorch DCGAN sample application](https://github.com/roscisz/dnn_training_benchmarks/tree/master/PyTorch_dcgan_lsun/README.md) +in a distributed setup consisting of a NVIDIA DGX Station server `ai` and NVIDIA DGX-1 server `dl`, +equipped with 4 and 8 NVIDIA Tesla V100 GPUs respectively. + +In the presented scenario, the servers were shared by a group of users using TensorHive +and at the moment we were granted reservations for GPUs 1 and 2 on `ai` and GPUs 1 and 7 on `dl`. +The python environment and training code were available on both nodes and +fake training dataset was used. + + +## Running without TensorHive + +In order to enable networking, we had to set the `GLOO_SOCKET_IFNAME` +environment variable to proper network interface names on both nodes. +We selected the 20011 TCP port for communication. + +For our 4 GPU scenario, the following 4 processes had to be executed, +taking into account setting consecutive `rank` parameters starting from 0 and the `world-size` +parameter to 4: + +worker 0 on `ai`: +``` +export CUDA_VISIBLE_DEVICES=1 +export GLOO_SOCKET_IFNAME=enp2s0f1 +./dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/venv/bin/python dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/main.py --init-method tcp://ai.eti.pg.gda.pl:20011 --backend=gloo --rank=0 --world-size=4 --dataset fake --cuda +``` + +worker 1 on `ai`: +``` +export CUDA_VISIBLE_DEVICES=2 +export GLOO_SOCKET_IFNAME=enp2s0f1 +./dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/venv/bin/python dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/main.py --init-method tcp://ai.eti.pg.gda.pl:20011 --backend=gloo --rank=1 --world-size=4 --dataset fake --cuda +``` + +worker 2 on `dl`: +``` +export CUDA_VISIBLE_DEVICES=1 +export GLOO_SOCKET_IFNAME=enp1s0f0 +./dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/venv/bin/python dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/main.py --init-method tcp://ai.eti.pg.gda.pl:20011 --backend=gloo --rank=2 --world-size=4 --dataset fake --cuda +``` + +worker 3 on ai: +``` +export CUDA_VISIBLE_DEVICES=7 +export GLOO_SOCKET_IFNAME=enp1s0f0 +./dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/venv/bin/python dnn_training_benchmarks/PyTorch_dcgan_lsun/examples/dcgan/main.py --init-method tcp://ai.eti.pg.gda.pl:20011 --backend=gloo --rank=3 --world-size=4 --dataset fake --cuda +``` + + +## Running with TensorHive + +Because running the distributed training required in our scenario means +logging into multiple nodes, configuring environments and running processes +with multiple, similar parameters, differing only slightly, it is a good +use case for the TensorHive `task execution` module. + +To use it, first head to `Task Overview` and click on `CREATE TASKS FROM TEMPLATE`. Choose PyTorch from the drop-down list: + +![choose_template](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/PyTorch/img/choose_template.png) + +Fill the PyTorch process template with your specific python command, command-line +parameters and environment variables. +You don't need to fill in rank or world-size parameters as TensorHive will do that automatically for you: + +![parameters](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/PyTorch/img/parameters.png) + +Add as many tasks as resources you wish the code to run on using `ADD TASK` button. You can see that every parameter filled is copied to newly created tasks to save time. Adjust hostnames and resources on the added tasks as needed. + +![full_conf](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/PyTorch/img/full_conf.png) + + +Click `CREATE ALL TASKS` button in the right bottom corner to create the tasks. +Then, select them in the process table and use the `Spawn selected tasks` button, +to run them on the appropriate nodes: + +![running](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/PyTorch/img/running.png) + +After that, the tasks can be controlled from `Task Overview`. +The following actions are currently available: +- Schedule (Choose when to run the task) +- Spawn (Run the task now) +- Terminate (Send terminate command to the task) +- Kill (Send kill command) +- Show log (Read output of the task) +- Edit +- Remove + +Having that high level control over all of the tasks from a single place can be extremely time-saving! diff --git a/examples/PyTorch/img/choose_template.png b/examples/PyTorch/img/choose_template.png new file mode 100644 index 00000000..a863218b Binary files /dev/null and b/examples/PyTorch/img/choose_template.png differ diff --git a/examples/PyTorch/img/full_conf.png b/examples/PyTorch/img/full_conf.png new file mode 100644 index 00000000..9f417c0e Binary files /dev/null and b/examples/PyTorch/img/full_conf.png differ diff --git a/examples/PyTorch/img/parameters.png b/examples/PyTorch/img/parameters.png new file mode 100644 index 00000000..788d0dbb Binary files /dev/null and b/examples/PyTorch/img/parameters.png differ diff --git a/examples/PyTorch/img/running.png b/examples/PyTorch/img/running.png new file mode 100644 index 00000000..c463fa37 Binary files /dev/null and b/examples/PyTorch/img/running.png differ diff --git a/examples/README.md b/examples/README.md index e45e8d6f..96ea3ea6 100644 --- a/examples/README.md +++ b/examples/README.md @@ -1,16 +1,11 @@ # Examples -This directory contains examples of deep neural network training applications that serve as -requirement providers towards TensorHive. We are testing their performance and ease of use depending -on computing resource management software used. This allows us to learn about various features of -existing resource management systems and base our design decisions on the experiences with approaches -such as native application execution, application-specific scripts, Docker and -[Kubernetes](https://gist.github.com/PiotrowskiD/1e3f659a8ac7db1c2ca02ba0ae5fcfaf). - -The applications can be also useful for others as benchmarks. Our benchmark results, useful resources -and more info about configuration and running examples can be found in corresponding folders. -We plan to add TensorHive usage examples to the individual directories when distributed training deployment -is supported by TensorHive - - - - +This directory contains usage examples of the TensorHive task execution +module for chosen DNN training scenarios: + +Directory | Description +--- | --- +TF_CONFIG | Using the default cluster configuration method in TensorFlowV2 - the TF_CONFIG environment variable. +TensorFlow_ClusterSpec | Using the standard ClusterSpec parameters, often used in TensorFlowV1 implementations. +PyTorch | Using standard parameters used in PyTorch implementations. +deepspeech | Redirection to the DeepSpeech test application, kept for link maintenance. +t2t_transformer | Redirection to the DeepSpeech test application, kept for link maintenance. \ No newline at end of file diff --git a/examples/TF_CONFIG/README.md b/examples/TF_CONFIG/README.md index f640d6cc..092f8e64 100644 --- a/examples/TF_CONFIG/README.md +++ b/examples/TF_CONFIG/README.md @@ -1,9 +1,9 @@ -# Using TensorHive for running distributed trainings using TF_CONFIG +# Running distributed trainings through TensorHive using TF_CONFIG -This example shows how to use the TensorHive `task nursery` module to -conveniently orchestrate distributed trainings configured using -the TF_CONFIG environment variable. This -[MSG-GAN training application](https://github.com/roscisz/dnn_training_benchmarks/tree/master/TensorFlowV2_MSG-GAN_Fashion-MNIST) +This example shows how to use the TensorHive `task execution` module to +conveniently configure and execute distributed trainings using +the TF_CONFIG environment variable. +[This MSG-GAN training application](https://github.com/roscisz/dnn_training_benchmarks/tree/master/TensorFlowV2_MSG-GAN_Fashion-MNIST) was used for the example. ## Running the training without TensorHive @@ -31,7 +31,9 @@ TF_CONFIG='{"cluster":{"worker":["gl01:2222", "gl02:2222"]}, "task":{"type": "wo **Other environment variables** Depending on the environment, some other environment variables may have to be configured. -For example, because our TensorFlow compilation uses a custom MPI library, the LD_LIBRARY_PATH environment +For example, in multi-GPU nodes, setting proper value of the CUDA_VISIBLE_DEVICES is useful +to prevent the process from needlessly utilizing GPU memory. In this example, +because the utilized TensorFlow compilation uses a custom MPI library, the LD_LIBRARY_PATH environment variable has to be set for each process to /usr/mpi/gcc/openmpi-4.0.0rc5/lib/. **Choosing the appropriate Python version** @@ -46,7 +48,7 @@ is used, so the python binary has to be defined as follows: **Summary** -Finally, full commands required to start in the exemplary setup our environment, will be as follows: +Finally, full commands required to launch the training in our exemplary environment will be as follows: gl01: @@ -67,12 +69,12 @@ export LD_LIBRARY_PATH='/usr/mpi/gcc/openmpi-4.0.0rc5/lib/' ## Running the training with TensorHive -The TensorHive `task nursery` module allows convenient orchestration of distributed trainings. +The TensorHive `task execution` module allows convenient orchestration of distributed trainings. It is available in the Tasks Overview view. The `CREATE TASKS FROM TEMPLATE` button allows to conveniently configure tasks supporting a specific framework or distribution method. In this example we choose the Tensorflow - TF_CONFIG template, and click `GO TO TASK CREATOR`: -![choose_template](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/choose_template.png) +![choose_template](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/TF_CONFIG/img/choose_template.png) In the task creator, we set the Command to ``` @@ -82,7 +84,7 @@ In the task creator, we set the Command to In order to add the LD_LIBRARY_PATH environment variable, we enter the parameter name, select Static (the same value for all processes) and click `ADD AS ENV VARIABLE TO ALL TASKS`: -![env_var](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/env_var.png) +![env_var](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/TF_CONFIG/img/env_var.png) Then, set the appropriate value of the environment variable (/usr/mpi/gcc/openmpi-4.0.0rc5/lib/). @@ -93,20 +95,20 @@ to specify batch size, we enter parameter name --batch_size, again select Static Select the required hostname and resource (CPU/GPU_N) for the specified training process. The resultant command that will be executed by TensorHive on the selected node will be displayed above the process specification: -![single_process](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/single_process.png) +![single_process](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/TF_CONFIG/img/single_process.png) Note that the TF_CONFIG and CUDA_VISIBLE_DEVICES variables are configured automatically. Now, use the `ADD TASK` button to duplicate the processes and modify the required target hosts to create your training processes. For example, this screenshot shows the configuration for training on 4 hosts: gl01, gl02, gl03, gl04: -![multi_process](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/multi_process.png) +![multi_process](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/TF_CONFIG/img/multi_process.png) -After clicking the `CREATE ALL TASKS` button, the processes will be available in the process list for future actions. +After clicking the `CREATE ALL TASKS` button, the processes will be available on the process list for future actions. To run the processes, select them and use the `Spawn selected tasks` button. If TensorHive is configured properly, the task status should change to `running`: -![running](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/multi_process.png) +![running](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/TF_CONFIG/img/running.png) Note, that the appropriate process PID will be displayed in the `pid` column. Task overview can be used to schedule, spawn, stop, kill, edit the tasks, and see logs from their execution. diff --git a/examples/TensorFlow_ClusterSpec/README.md b/examples/TensorFlow_ClusterSpec/README.md new file mode 100644 index 00000000..863cda66 --- /dev/null +++ b/examples/TensorFlow_ClusterSpec/README.md @@ -0,0 +1,133 @@ +# Running distributed trainings through TensorHive using ClusterSpec + +This example shows how to use the TensorHive `task execution` module to +conveniently configure and execute distributed trainings configured using +standard TensorFlowV1 parameters for the [ClusterSpec](https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec) +cluster configuration class. +[This DeepSpeech training application](https://github.com/roscisz/dnn_training_benchmarks/tree/master/TensorFlowV1_DeepSpeech_ldc93s1) +was used for the example. + +## Running the training without TensorHive + +In order to run the training manually, a separate worker process `python DeepSpeech.py` +on each node and a separate parameter server process have to be run +with the appropriate parameter values set as follows: + +**Application-specific parameters** + +The training application used in this example requires specifying train, dev and +test data sets through the `train_files`, `dev_files` and `test_files` parameters. + +**ClusterSpec parameters** + +The `ps_hosts`, `worker_hosts`, `job_name` and `task_index` parameters +have to be appropriately configured depending +on the set of nodes taking part in the computations. +For example, a training on two nodes 172.17.0.3 and 172.17.0.4 would require the following +parameter values: + +worker on 172.17.0.3: +```bash +--ps_hosts=172.17.0.3:2224 +--worker_hosts=172.17.0.3:2222,172.17.0.4:2223 +--job_name=worker +--task_index=0 +``` + +worker on 172.17.0.4: +```bash +--ps_hosts=172.17.0.3:2224 +--worker_hosts=172.17.0.3:2222,172.17.0.4:2223 +--job_name=worker +--task_index=1 +``` +parameter server on 172.17.0.3: +```bash +--ps_hosts=172.17.0.3:2224 +--worker_hosts=172.17.0.3:2222,172.17.0.4:2223 +--job_name=ps +--task_index=0 +``` + +**Other environment variables** + +Depending on the environment, some other environment variables may have to be configured. +For example, in multi-GPU nodes, setting proper value of the CUDA_VISIBLE_DEVICES is useful +to prevent the process from needlessly utilizing GPU memory. +In this example, because the Mozilla DeepSpeech native client libraries are used, the +LD_LIBRARY_PATH environment variable has to be set for each process to `native_client/`. + +**Choosing the appropriate Python version** + +In some cases, a specific Python binary has to be used for the training. +For example, in our environment, a python binary from a virtual environment +is used, so the python binary has to be defined as follows: + +``` +./venv/bin/python +``` + +**Summary** + +Finally, full commands required to launch the training in our exemplary environment will be as follows: + +worker on 172.17.0.3: +```bash +export LD_LIBRARY_PATH=native_client/ +./venv/bin/python ./DeepSpeech.py --train_files=ldc93s1/ldc93s1.csv --dev_files=ldc93s1/ldc93s1.csv --test_files=ldc93s1/ldc93s1.csv --ps_hosts=172.17.0.3:2224 --worker_hosts=172.17.0.3:2222,172.17.0.4:2223 --job_name=worker --task_index=0 +``` + +worker on 172.17.0.4: +```bash +export LD_LIBRARY_PATH=native_client/ +./venv/bin/python ./DeepSpeech.py --train_files=ldc93s1/ldc93s1.csv --dev_files=ldc93s1/ldc93s1.csv --test_files=ldc93s1/ldc93s1.csv --ps_hosts=172.17.0.3:2224 --worker_hosts=172.17.0.3:2222,172.17.0.4:2223 --job_name=worker --task_index=1 +``` + +parameter server on 172.17.0.3: +```bash +export LD_LIBRARY_PATH=native_client/ +./venv/bin/python ./DeepSpeech.py --train_files=ldc93s1/ldc93s1.csv --dev_files=ldc93s1/ldc93s1.csv --test_files=ldc93s1/ldc93s1.csv --ps_hosts=172.17.0.3:2224 --worker_hosts=172.17.0.3:2222,172.17.0.4:2223 --job_name=ps --task_index=0 +``` + +## Running the training with TensorHive + +The TensorHive `task execution` module allows convenient orchestration of distributed trainings. +It is available in the Tasks Overview view. The `CREATE TASKS FROM TEMPLATE` button allows to +conveniently configure tasks supporting a specific framework or distribution method. In this +example we choose the Tensorflow - cluster parameters template, and click `GO TO TASK CREATOR`: + +![choose_template](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/TensorFlow_ClusterSpec/img/choose_template.png) + +In the task creator, we set the Command to +``` +./venv/bin/python ./DeepSpeech.py --train_files=ldc93s1/ldc93s1.csv --dev_files=ldc93s1/ldc93s1.csv --test_files=ldc93s1/ldc93s1.csv +``` + +In order to add the LD_LIBRARY_PATH environment variable, we enter the parameter name, +select Static (the same value for all processes) and click `ADD AS ENV VARIABLE TO ALL TASKS`. +Then, set the appropriate value of the environment variable (native_client/): + + +![env_var](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/TensorFlow_ClusterSpec/img/env_var.png) + + +The task creator allows also to conveniently specify other command-line arguments. For example, +should the `train_files`, `dev_files` and `test_files` parameters change throughout the training +processes, they could be handled by `ADD AS PARAMETER TO ALL TASKS`. + +Use `ADD TASK` to create as many copies of the defined process as required and +select the appropriate hostname and resource (CPU/GPU_N) for the specified training process. Change +`job_name` to `ps` for the parameter server process. The resultant +command that will be executed by TensorHive on the selected node will be displayed above the process specification. +Note that the cluster parameters and CUDA_VISIBLE_DEVICES variable are configured automatically: + +![ready](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/TensorFlow_ClusterSpec/img/ready.png) + +After clicking the `CREATE ALL TASKS` button, the processes will be available on the process list for future actions. +To run the processes, select them and use the `Spawn selected tasks` button. If TensorHive is configured properly, +the task status should change to `running`: + +![running](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/TensorFlow_ClusterSpec/img/running.png) + +Note, that the appropriate process PID will be displayed in the `pid` column. Task overview can +be used to schedule, spawn, stop, kill, edit the tasks, and see logs from their execution. diff --git a/examples/TensorFlow_ClusterSpec/img/choose_template.png b/examples/TensorFlow_ClusterSpec/img/choose_template.png new file mode 100644 index 00000000..76aae07f Binary files /dev/null and b/examples/TensorFlow_ClusterSpec/img/choose_template.png differ diff --git a/examples/TensorFlow_ClusterSpec/img/env_var.png b/examples/TensorFlow_ClusterSpec/img/env_var.png new file mode 100644 index 00000000..6a44ebd0 Binary files /dev/null and b/examples/TensorFlow_ClusterSpec/img/env_var.png differ diff --git a/examples/TensorFlow_ClusterSpec/img/ready.png b/examples/TensorFlow_ClusterSpec/img/ready.png new file mode 100644 index 00000000..d2f16a64 Binary files /dev/null and b/examples/TensorFlow_ClusterSpec/img/ready.png differ diff --git a/examples/TensorFlow_ClusterSpec/img/running.png b/examples/TensorFlow_ClusterSpec/img/running.png new file mode 100644 index 00000000..cf3b19e3 Binary files /dev/null and b/examples/TensorFlow_ClusterSpec/img/running.png differ diff --git a/examples/deepspeech/README.md b/examples/deepspeech/README.md index e331182f..bc3779e9 100644 --- a/examples/deepspeech/README.md +++ b/examples/deepspeech/README.md @@ -4,5 +4,7 @@ The distributed DeepSpeech training application that had been used as a test application and requirement provider towards TensorHive is now in a [separate repository](https://github.com/roscisz/dnn_training_benchmarks/tree/master/TensorFlowV1_DeepSpeech_ldc93s1). -See the [TensorHive examples directory](https://github.com/roscisz/TensorHive/tree/master/examples) for -examples how TensorHive task execution module can be used for various training applications. \ No newline at end of file +The application has been also used in the TensorHive task execution +module example for the [TensorFlow ClusterSpec](https://github.com/roscisz/TensorHive/tree/master/examples/TensorFlow_ClusterSpec) +scenario. See also the [TensorHive examples directory](https://github.com/roscisz/TensorHive/tree/master/examples) for other +examples how TensorHive task execution module can be used for various training scenarios. \ No newline at end of file diff --git a/examples/t2t_transformer/README.md b/examples/t2t_transformer/README.md index 516d1a85..23fa5e9e 100644 --- a/examples/t2t_transformer/README.md +++ b/examples/t2t_transformer/README.md @@ -5,4 +5,4 @@ application and requirement provider towards TensorHive is now in a [separate repository](https://github.com/roscisz/dnn_training_benchmarks/tree/master/TensorFlowV1_T2T-Transformer_English-German). See the [TensorHive examples directory](https://github.com/roscisz/TensorHive/tree/master/examples) for -examples how TensorHive task execution module can be used for various training applications. +examples how TensorHive task execution module can be used for various training scenarios. diff --git a/images/task_nursery_screenshot1.png b/images/task_nursery_screenshot1.png deleted file mode 100644 index dbad8558..00000000 Binary files a/images/task_nursery_screenshot1.png and /dev/null differ diff --git a/tensorhive/__init__.py b/tensorhive/__init__.py index 73e3bb4f..80eb7f98 100644 --- a/tensorhive/__init__.py +++ b/tensorhive/__init__.py @@ -1 +1 @@ -__version__ = '0.3.2' +__version__ = '0.3.3' diff --git a/tensorhive/app/web/dist/static/config.json b/tensorhive/app/web/dist/static/config.json index 0dcf5b12..4e82c4d5 100644 --- a/tensorhive/app/web/dist/static/config.json +++ b/tensorhive/app/web/dist/static/config.json @@ -1 +1 @@ -{"apiPath": "http://localhost:1111/api/0.3.1", "version": "0.3.2", "apiVersion": "0.3.1"} +{"apiPath": "http://localhost:1111/api/0.3.1", "version": "0.3.3", "apiVersion": "0.3.1"}