Skip to content

Commit

Permalink
Merge pull request #233 from roscisz/develop
Browse files Browse the repository at this point in the history
r0.3.1
  • Loading branch information
roscisz authored Nov 27, 2019
2 parents 69212d8 + 2fa039e commit 3c2a4cf
Show file tree
Hide file tree
Showing 61 changed files with 1,720 additions and 1,347 deletions.
65 changes: 34 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
TensorHive
===
![](https://img.shields.io/badge/release-v0.3-brightgreen.svg?style=popout-square)
![](https://img.shields.io/badge/pypi-v0.3-brightgreen.svg?style=popout-square)
![](https://img.shields.io/badge/release-v0.3.1-brightgreen.svg?style=popout-square)
![](https://img.shields.io/badge/pypi-v0.3.1-brightgreen.svg?style=popout-square)
![](https://img.shields.io/badge/Issues%20and%20PRs-welcome-yellow.svg?style=popout-square)
![](https://img.shields.io/badge/platform-Linux-blue.svg?style=popout-square)
![](https://img.shields.io/badge/hardware-Nvidia-green.svg?style=popout-square)
Expand All @@ -22,7 +22,7 @@ Our goal is to provide solutions for painful problems that ML engineers often ha
#### You should really consider using TensorHive if anything described in profiles below matches you:
1. You're an **admin**, who is responsible for managing a cluster (or multiple servers) with powerful GPUs installed.
- :angry: There are more users than resources, so they have to compete for it, but you don't know how to deal with that chaos
- :ocean: Other popular tools are simply an overkill, have different purpose or require a lot of time to spend on reading documentation, installation and configuration (Graphana, Kubernetes, Slurm)
- :ocean: Other popular tools are simply an overkill, have different purpose or require a lot of time to spend on reading documentation, installation and configuration (Grafana, Kubernetes, Slurm)
- :penguin: People using your infrastructure expect only one interface for all the things related to training models (besides terminal): monitoring, reservation calendar and scheduling distributed jobs
- :collision: Can't risk messing up sensitive configuration by installing software on each individual machine, prefering centralized solution which can be managed from one place

Expand Down Expand Up @@ -54,54 +54,56 @@ For more details, check out the [full list of features](#features).
Getting started
---------------
### Prerequisites
* All nodes must be accessible via SSH, without password, using SSH Key-Based Authentication ([How to set up SSH keys](https://www.shellhacks.com/ssh-login-without-password/) - explained in [Quickstart section](#basic-usage)
* All nodes must be accessible via SSH, without password, using SSH Key-Based Authentication ([How to set up SSH keys](https://www.shellhacks.com/ssh-login-without-password/) - explained in [Quickstart section](#basic-usage))
* Only NVIDIA GPUs are supported (relying on ```nvidia-smi``` command)
* Currently TensorHive assumes that all users who want to register into the system must have identical UNIX usernames on all nodes configured by TensorHive administrator (not relevant for standalone developers)

### Installation

#### via pip (not updated yet)
#### via pip
```shell
pip install tensorhive
```

#### via conda (not updated yet)
```shell
conda install tensorhive
```

#### From source (recommended)
(optional) For development purposes we encourage separation from your current python packages using e.g. [Miniconda](https://docs.conda.io/en/latest/miniconda.html)
`conda create --name th_env python=3.5 pip; activate th_env`
#### From source
(optional) For development purposes we encourage separation from your current python packages using e.g. virtualenv, Anaconda.

```shell
git clone https://github.com/roscisz/TensorHive.git && cd TensorHive
git checkout fixes/voicelab
pip install -e .
```

TensorHive is already shipped with newest web app build, but in case you modify the source, you can can build it with `make app` (currently on `master` branch). For more useful commands see our [Makefile](https://github.com/roscisz/TensorHive/blob/master/tensorhive/Makefile).
Build tested with `Node v11.14.0` and `npm 6.7.0`
Build tested with `Node v10.15.2` and `npm 5.8.0`

Basic usage
-----
#### Quickstart
Each command will guide you through basic configuration process:
The `init` command will guide you through basic configuration process:
```
tensorhive init
tensorhive key
```

You can check connectivity with the configured hosts using the `test` command.
```
tensorhive test
```

(optional) If you want to allow your UNIX users to set up their TensorHive accounts on their own and run distributed
programs through `Task nursery` plugin, use the `key` command to generate the SSH key for TensorHive:
```
tensorhive key
```

Now you should be ready to launch a TensorHive instance:
```
tensorhive
```

Web application and API Documentation can be accessed via URLs highlighted in green (Ctrl + click to open in browser)
Web application and API Documentation can be accessed via URLs highlighted in green (Ctrl + click to open in browser).

#### Advanced configuration
You can fully customize TensorHive behaviours via INI configuration (which will be created automatically after `tensorhive init`
You can fully customize TensorHive behaviours via INI configuration files (which will be created automatically after `tensorhive init`):
```
~/.config/TensorHive/main_config.ini
~/.config/TensorHive/mailbot_config.ini
Expand All @@ -113,37 +115,35 @@ You can fully customize TensorHive behaviours via INI configuration (which will
Accessible infrastructure can be monitored in the Nodes overview tab. Sample screenshot:
Here you can add new watches, select metrics and monitor ongoing GPU processes and its' owners.

![image](https://user-images.githubusercontent.com/12485656/61520152-d963bd80-aa0d-11e9-9caa-1f7203cc6b42.png)
![image](https://raw.githubusercontent.com/roscisz/TensorHive/master/images/nodes_overview_screenshot.png)

#### GPU Reservation calendar

Each column represents all reservation events for a GPU on a given day.
In order to make a new reservation simply click and drag with your mouse, select GPU(s), add some meaningful title, optionally adjust time range.

If there are many hosts and GPUs in our infrastructure, you can use our simplified, horizontal calendar to quickly identify empty time slots and filter out already reserved GPUs.

(UI prototype: redesign is coming)
![image](https://user-images.githubusercontent.com/12485656/61517527-bb935a00-aa07-11e9-8ea3-9db4a1529e24.png)
![image](https://raw.githubusercontent.com/roscisz/TensorHive/master/images/reservations_overview_screenshot.png)

From now on, **only your processes are eligible to run on reserved GPU(s)**. TensorHive periodically checks if some other user has violated it. He will be spammed with warnings on all his PTYs, emailed every once in a while, additionally admin will also be notified (it all depends on the configuration).

Terminal warning | Email warning
:-------------------------:|:-------------------------:
![image](https://user-images.githubusercontent.com/12485656/61520488-99e9a100-aa0e-11e9-8f35-b02c2e7de9ce.png) | ![image](https://user-images.githubusercontent.com/12485656/61520956-85f26f00-aa0f-11e9-8342-09023c93275b.png)
![image](https://raw.githubusercontent.com/roscisz/TensorHive/master/images/terminal_warning_screenshot.png) | ![image](https://raw.githubusercontent.com/roscisz/TensorHive/master/images/email_warning_screenshot.png)

#### What admin sees:
#### What admin is e-mailed:

![image](https://user-images.githubusercontent.com/12485656/61520807-4a57a500-aa0f-11e9-8a52-cb87208d6c71.png)
![image](https://raw.githubusercontent.com/roscisz/TensorHive/master/images/admin_warning_screenshot.png)

#### Task nursery

Here you can define commands for tasks you want to run on any configured nodes. You can manage them manually or set spawn/terminate date.
Commands are run within `screen` session, so attaching to it while they are running is a piece of cake.
![image](https://user-images.githubusercontent.com/12485656/61518173-4163d500-aa09-11e9-9916-59c907c1590c.png)
![image](https://raw.githubusercontent.com/roscisz/TensorHive/master/images/task_nursery_screenshot1.png)

It provides quite simple, but flexible (**framework-agnostic**) command templating mechanism that will help you automate multi-node trainings.
TensorHive requires that users who want to use this feature must append TensorHive's public key to their `~/.ssh/authorized_keys` on all nodes they want to connect to.
![image](https://user-images.githubusercontent.com/12485656/61518418-bcc58680-aa09-11e9-8943-88bddc964417.png)
![image](https://raw.githubusercontent.com/roscisz/TensorHive/master/images/task_nursery_screenshot2.png)

Features
----------------------
Expand Down Expand Up @@ -219,7 +219,7 @@ TensorHive architecture (simplified)

This diagram will help you to grasp the rough concept of the system.

![TensorHive_diagram _final](https://user-images.githubusercontent.com/12485656/59147556-7853cd80-89fd-11e9-80bc-5848e95c7574.png)
![TensorHive_diagram _final](https://raw.githubusercontent.com/roscisz/TensorHive/master/images/architecture.png)


Contibution and feedback
Expand All @@ -229,12 +229,15 @@ Contibution and feedback
We'd :heart: to collect your observations, issues and pull requests!

Feel free to **report any configuration problems, we will help you**.
We plan to redesign the UI/UX side as well as improve reliability of the system until September 2019 :shipit:, so stay tuned!

We plan to develop examples of running distributed DNN training applications
in `Task nursery` along with templates for TF_CONFIG and PyTorch, deadline - March 2020 :shipit:, so stay tuned!

Credits
-------

TensorHive has been created within a joint project between [**VoiceLab.ai**](https://voicelab.ai) and

TensorHive has been greatly supported within a joint project between [**VoiceLab.ai**](https://voicelab.ai) and
[**Gdańsk University of Technology**](https://pg.edu.pl/) titled: "Exploration and selection of methods
for parallelization of neural network training using multiple GPUs".

Expand Down
Binary file added images/admin_warning_screenshot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/email_warning_screenshot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/nodes_overview_screenshot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/reservations_overview_screenshot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/task_nursery_screenshot1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/terminal_warning_screenshot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion requirements-dev.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
pytest==4.0.1
pytest==5.3.1
pytest-faker==2.0.0
pytest-env==0.6.2
alembic==1.0.3
Expand Down
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
install_requires=[
'parallel-ssh==1.9.1',
'passlib==1.7.1',
'sqlalchemy==1.2.14',
'sqlalchemy==1.3.0',
'sqlalchemy-utils==0.33.8',
'click==7.0',
'connexion==1.5.3',
Expand All @@ -32,7 +32,7 @@
'gunicorn==19.9.0',
'coloredlogs==10.0',
'Safe==0.4',
'python-usernames==0.2.2'
'python-usernames==0.2.3'
],
zip_safe=False
)
2 changes: 1 addition & 1 deletion tensorhive/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = '0.3'
__version__ = '0.3.1'
2 changes: 1 addition & 1 deletion tensorhive/api/APIServer.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ def shutdown_session(exception=None):
CORS(app.app)
log.info('[⚙] Starting API server with {} backend'.format(API_SERVER.BACKEND))
URL = 'http://{host}:{port}/{url_prefix}/ui/'.format(
host=API_SERVER.HOST,
host=API.URL_HOSTNAME,
port=API_SERVER.PORT,
url_prefix=API.URL_PREFIX)
log.info(green('[✔] API documentation (Swagger UI) available at: {}'.format(URL)))
Expand Down
24 changes: 22 additions & 2 deletions tensorhive/api/api_specification.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,13 @@ paths:
200:
description: {{RESPONSES['users']['get']['success']}}
schema:
$ref: '#/definitions/UserToDisplay'
type: object
properties:
msg:
type: string
example: {{RESPONSES['user']['get']['success']}}
user:
$ref: '#/definitions/UserToDisplay'
401:
description: {{RESPONSES['general']['unauthorized']}}
403:
Expand All @@ -61,8 +67,18 @@ paths:
msg:
type: string
example: {{RESPONSES['general']['unpriviliged']}}
404:
description: {{RESPONSES['user']['not_found']}}
schema:
type: object
properties:
msg:
type: string
example: {{RESPONSES['user']['not_found']}}
422:
description: {{RESPONSES['general']['auth_error']}}
500:
description: {{RESPONSES['general']['internal_error']}}
security:
- Bearer: []
/user/create:
Expand Down Expand Up @@ -867,7 +883,7 @@ paths:
path:
type: string
example: ~/TensorHiveLogs/task_99.log
stdout_lines:
output_lines:
type: array
items:
type: string
Expand Down Expand Up @@ -1261,6 +1277,7 @@ definitions:
- description
- resourceId
- userId
- userName
- gpuUtilAvg
- memUtilAvg
- start
Expand All @@ -1287,6 +1304,9 @@ definitions:
userId:
type: integer
example: 1
userName:
type: string
example: Example owner's username
gpuUtilAvg:
type: integer
example: 99
Expand Down
4 changes: 2 additions & 2 deletions tensorhive/app/web/AppServer.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ def _inject_api_endpoint_to_app():
web_app_json_config_path = PosixPath(__file__).parent / 'dist/static/config.json'
data = {
'apiPath': 'http://{}:{}/{}'.format(
API_SERVER.HOST,
API.URL_HOSTNAME,
API_SERVER.PORT,
API.URL_PREFIX),
'version': tensorhive.__version__,
Expand Down Expand Up @@ -79,7 +79,7 @@ def start_server():
'workers': APP_SERVER.WORKERS,
'loglevel': APP_SERVER.LOG_LEVEL
}
log.info(green('[✔] Web App avaliable at: http://' + options['bind']))
log.info(green('[✔] Web App available at: http://{}:{}'.format(API.URL_HOSTNAME, APP_SERVER.PORT)))
GunicornStandaloneApplication(app, options).run()
else:
raise NotImplementedError('Selected backend is not supported yet.')
Expand Down
Loading

0 comments on commit 3c2a4cf

Please sign in to comment.