Skip to content

Commit

Permalink
Add dist training support for UAI Training system.
Browse files Browse the repository at this point in the history
UAI Train now support TensorFlow and MXNet dist train.

uaitrain/arch/tensorflow/uai_dist.py gives tensorflow dist config parser
Also add example of dist training:
1. cifar
2. slim
3. wide-deep
  • Loading branch information
宋翔 committed Apr 16, 2018
1 parent 7379d94 commit 3301852
Show file tree
Hide file tree
Showing 9 changed files with 1,882 additions and 19 deletions.
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,19 @@
- Tensorflow(1.2.0 tested)
- Tensorflow(1.3.0 tested)
- Tensorflow(1.4.0 tested)
- Tensorflow(1.5.0 tested)
- Tensorflow(1.6.0 tested)
- MXNet(0.9.5 tested)
- MXNet(1.0.0 tested)
- Keras(1.2.0 tested)
- Caffe(1.0.0 tested)
- Caffe2(Detectron)
- PyTorch(0.2.0)

### UAI Train Supporting Distributed Training
- Tensorflow Distributed Training (examples include slim/cifar/wide-deep)
- MXNet Distributed Training

## How to install
1. Install your deep learning python package, such as Tensorflow, MXNet, Keras, Caffe (tested version preferred)
2. Install UCloud Ufile SDK (https://docs.ucloud.cn/storage_cdn/ufile/tools)
Expand Down
30 changes: 27 additions & 3 deletions examples/tensorflow/train/cifar/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,14 @@ Code here is an example of how to run tensorflow cifar example on UAI Train plat

The original code can see here: https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator

## UAI Example
## UAI Example

We do following modifications to the cifar10_main.py:

1. Add UAI SDK related arguments: --data\_dir, --output\_dir, --work\_dir, --log\_dir, --num\_gpus, these arguments are auto generated by UAI Train Platform, see: https://github.com/ucloud/uai-sdk/blob/master/uaitrain/arch/tensorflow/uflag.py for more details
2. Modify code to use UAI arguments: use data_dir as input dir and use output_dir as model output dir

## How to run
### How to run
We assume you fully understand how UAI Train docker image works and has already reads the DOCS here: https://docs.ucloud.cn/ai/uai-train/guide/tensorflow

1. Follow https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10\_estimator to generate tfrecords of cifar10
Expand All @@ -40,4 +41,27 @@ We assume you fully understand how UAI Train docker image works and has already
--train_params="--batch_size=128"

Note:
The tfrecords should stored in LOCAL\_PATH\_TO\_CIFAR10\_DATA\_FOR\_TEST in this example.
The tfrecords should stored in LOCAL\_PATH\_TO\_CIFAR10\_DATA\_FOR\_TEST in this example.

### Run Distributed Training
As cifar10 example code use the tf.estimator.Estimator API, it can run distributed training directly. You only needs a distribted training environment and a dist-config.

A standard TF\_CONFIG (compatable with the tf.estimator.Estimator API) looks like this:

TF_CONFIG = {
"cluster":{
"master":["ip0:2222"],
"ps":["ip0:2223","ip1:2223"],
"worker":["ip1:2222"]},
"task":{"type":"worker","index":0},
"environment":"cloud"
}

You can generate TF\_CONFIG for each node in your cluster and run the training.

### Run Distributed Training On UAI Platform
UAI Train Platform can dynamicaaly deploy the training cluster and generate the TF\_CONFIG for each training node. You only need to run the training cmd as:

/data/cifar10_main.py --train-batch-size=16

For more details please see https://docs.ucloud.cn/ai/uai-train.
115 changes: 115 additions & 0 deletions examples/tensorflow/train/slim/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# TF Slim Example
TF-slim is a new lightweight high-level API of TensorFlow (tensorflow.contrib.slim) for defining, training and evaluating complex models. Here we shows how to run TF-slim example code on UAI Platform. This example is based on https://github.com/tensorflow/models/tree/master/research/slim

## Setup
You should follow the "Preparing the datasets" of https://github.com/tensorflow/models/tree/master/research/slim to download and preparing the tfrecord format data. For imagenet dataset it will looks like:

$ ls /data/imagenet/tf-record/
train-00000-of-01024.tfrecord
train-00001-of-01024.tfrecord
...
train-01023-of-01024.tfrecord
validation-00000-of-00128.tfrecord
...
validation-00127-of-00128.tfrecord
labels.txt

## Intro
You can directly use the code from https://github.com/tensorflow/models/tree/master/research/slim with some appropriate modifications. We have provided the modified code for you. You can find it under examples/tensorflow/train/slim/code/

## UAI Example
We do following modifications to the train\_image\_classifier.py:

Add UAI SDK related arguments parser: : --data\_dir, --output\_dir, --work\_dir, --log\_dir, --num\_gpus, these arguments are auto generated by UAI Train Platform and are defined in https://github.com/ucloud/uai-sdk/blob/master/uaitrain/arch/tensorflow/uflag.py.

# L16 in code/train_image_classifier.py
# import uflag from UAI-SDK
from preprocessing import preprocessing_factory
from uaitrain.arch.tensorflow import uflag
from uaitrain.arch.tensorflow import uai_dist

# L486 in code/train_image_classifier.py
# Translate UAI Args to Slim Args:
# FLAGS.data_dir for data input path
# FLAGS.output_dir for checkpoints and event logs
# FLAGS.num_gpus for num of gpus
def main(_):
FLAGS.dataset_dir = FLAGS.data_dir
FLAGS.train_dir = FLAGS.output_dir
FLAGS.num_clones = FLAGS.num_gpus

### Packing TF-Slim code
After Preparing the code (by replace the train_image_classifier.py), you can pack the tf-slim docker. We provide the slim.Dockerfile:

FROM uhub.service.ucloud.cn/uaishare/gpu_uaitrain_ubuntu-16.04_python-2.7.6_tensorflow_models:v1.8.0

COPY ./slim/ /data/

It use the base image provided by UAI (tensorflow 1.6 with tf-models 1.8.0) and copy the code into /data/.

Suppose you get the slim code, you can do the following steps to build your own UAI slim image:

$ cd ${SLIM_CODE_DIR}
$ ls slim/
BUILD datasets/ deployment/ ... train_image_classifier.py WORKSPACE
$ cp ${UAI-SLIM_EXAMPLE}/slim.Dockerfile .
$ cp ${UAI-SLIM_EXAMPLE}/train_image_classifier.py ./slim/
$ sudo docker build -t slim:test -f slim.Dockerfile .

### Run Single Node Training
You can use the docker image slim:test to run slim training locally with following cmd:

$ sudo nvidia-docker run -v /data/imagenet/tf-record/:/data/data/ -v /data/output/slim/:/data/output -it slim:test /bin/bash -c "python /data/train_image_classifier.py --data_dir=/data/data/ --output_dir=/data/output/ --num_gpus=1 --model_name=vgg_19"

You can also run this docker image on UAI Train Platform. For more details please see https://docs.ucloud.cn/ai/uai-train.

### Run Distributed Training
The Slim code also support distributed training. It requires a distribted training environment and a dist-config. We also provided the dist-training example code in code/train\_image\_classifier.py. It accepts the standard Tensorflow Dist Config from TF\_CONFIG. (To learn more about distributed training please refer to https://www.tensorflow.org/deploy/distributed).

A standard TF\_CONFIG (compatable with the tf.estimator.Estimator API) looks like this:

TF_CONFIG = {
"cluster":{
"master":["ip0:2222"],
"ps":["ip0:2223","ip1:2223"],
"worker":["ip1:2222"]},
"task":{"type":"worker","index":0},
"environment":"cloud"
}

We add a func \_get\_variables\_to\_train into code/train\_image\_classifier.py to parse the TF\_CONFIG and start the ps server and worker server.

We also modify the deployment/model\_deploy.py to control the device assignment func for clone devices of DeploymentConfig:

# Line 578 in code/deployment/model\_deploy.py
# add task id for each replica
if self._num_ps_tasks > 0:
device = '%s/task:%d' % (device, self._replica_id)

return device

To pack the slit-dist docker image, you shoud do the following steps:

$ cd ${SLIM_CODE_DIR}
$ ls slim/
BUILD datasets/ deployment/ ... train_image_classifier.py WORKSPACE
$ cp ${UAI-SLIM_EXAMPLE}/slim.Dockerfile .
$ cp ${UAI-SLIM_EXAMPLE}/train_image_classifier.py ./slim/
$ cp ${UAI-SLIM_EXAMPLE}/deployment/model_deploy.py ./slim/deployment
$ sudo docker build -t slim-dist:test -f slim.Dockerfile .

Now you can run the dist train with same cmd as local training:

python /data/train_image_classifier.py --data_dir=/data/data/ --output_dir=/data/output/ --num_gpus=1 --model_name=vgg_19

Note: you should generate TF\_CONFIG for each node in the cluster.

Note: This example run the training with async-replica mode.

#### Run Distributed Training On UAI Platform
UAI Train Platform can dynamicaaly deploy the training cluster and generate the TF\_CONFIG for each training node. You only need to run the training cmd as:

/data/train_image_classifier.py --batch_size=64 --model_name=vgg_19

For more details please see https://docs.ucloud.cn/ai/uai-train.

Loading

0 comments on commit 3301852

Please sign in to comment.