Add dist training support for UAI Training system.

UAI Train now support TensorFlow and MXNet dist train. uaitrain/arch/tensorflow/uai_dist.py gives tensorflow dist config parser Also add example of dist training: 1. cifar 2. slim 3. wide-deep
ucloud · Apr 16, 2018 · 3301852 · 3301852
1 parent 7379d94
commit 3301852
Show file tree

Hide file tree

Showing 9 changed files with 1,882 additions and 19 deletions.
diff --git a/README.md b/README.md
@@ -26,12 +26,19 @@
 - Tensorflow（1.2.0 tested）
 - Tensorflow（1.3.0 tested)
 - Tensorflow（1.4.0 tested）
+- Tensorflow（1.5.0 tested)
+- Tensorflow（1.6.0 tested）
 - MXNet(0.9.5 tested)
 - MXNet(1.0.0 tested)
 - Keras(1.2.0 tested)
 - Caffe(1.0.0 tested)
+- Caffe2(Detectron)
 - PyTorch(0.2.0)
 
+### UAI Train Supporting Distributed Training
+- Tensorflow Distributed Training (examples include slim/cifar/wide-deep)
+- MXNet Distributed Training
+
 ## How to install
 1. Install your deep learning python package, such as Tensorflow, MXNet, Keras, Caffe (tested version preferred)
 2. Install UCloud Ufile SDK (https://docs.ucloud.cn/storage_cdn/ufile/tools)

diff --git a/examples/tensorflow/train/cifar/README.md b/examples/tensorflow/train/cifar/README.md
@@ -8,13 +8,14 @@ Code here is an example of how to run tensorflow cifar example on UAI Train plat
 
 The original code can see here: https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator                                             
 
-## UAI Example                                                                                                                                                 
+## UAI Example                                                              
+
 We do following modifications to the cifar10_main.py:  
 
 1. Add UAI SDK related arguments: --data\_dir, --output\_dir, --work\_dir, --log\_dir, --num\_gpus, these arguments are auto generated by UAI Train Platform, see: https://github.com/ucloud/uai-sdk/blob/master/uaitrain/arch/tensorflow/uflag.py for more details                                                                
 2. Modify code to use UAI arguments: use data_dir as input dir and use output_dir as model output dir  
 
-## How to run
+### How to run
 We assume you fully understand how UAI Train docker image works and has already reads the DOCS here: https://docs.ucloud.cn/ai/uai-train/guide/tensorflow
 
 1. Follow https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10\_estimator to generate tfrecords of cifar10
@@ -40,4 +41,27 @@ We assume you fully understand how UAI Train docker image works and has already
     --train_params="--batch_size=128"
 
 Note: 
-The tfrecords should stored in LOCAL\_PATH\_TO\_CIFAR10\_DATA\_FOR\_TEST in this example.
+The tfrecords should stored in LOCAL\_PATH\_TO\_CIFAR10\_DATA\_FOR\_TEST in this example. 
+
+### Run Distributed Training
+As cifar10 example code use the tf.estimator.Estimator API, it can run distributed training directly. You only needs a distribted training environment and a dist-config. 
+
+A standard TF\_CONFIG (compatable with the tf.estimator.Estimator API) looks like this:
+
+    TF_CONFIG = {
+        "cluster":{
+            "master":["ip0:2222"],
+            "ps":["ip0:2223","ip1:2223"],
+            "worker":["ip1:2222"]},
+        "task":{"type":"worker","index":0},
+        "environment":"cloud"
+    }
+
+You can generate TF\_CONFIG for each node in your cluster and run the training.
+
+### Run Distributed Training On UAI Platform
+UAI Train Platform can dynamicaaly deploy the training cluster and generate the TF\_CONFIG for each training node. You only need to run the training cmd as:
+
+    /data/cifar10_main.py --train-batch-size=16
+
+For more details please see https://docs.ucloud.cn/ai/uai-train.
diff --git a/examples/tensorflow/train/slim/README.md b/examples/tensorflow/train/slim/README.md
@@ -0,0 +1,115 @@
+# TF Slim Example
+TF-slim is a new lightweight high-level API of TensorFlow (tensorflow.contrib.slim) for defining, training and evaluating complex models. Here we shows how to run TF-slim example code on UAI Platform. This example is based on https://github.com/tensorflow/models/tree/master/research/slim
+
+## Setup
+You should follow the "Preparing the datasets" of https://github.com/tensorflow/models/tree/master/research/slim to download and preparing the tfrecord format data. For imagenet dataset it will looks like:
+
+    $ ls /data/imagenet/tf-record/
+    train-00000-of-01024.tfrecord
+    train-00001-of-01024.tfrecord
+    ...
+    train-01023-of-01024.tfrecord
+    validation-00000-of-00128.tfrecord
+    ...
+    validation-00127-of-00128.tfrecord
+    labels.txt
+
+## Intro
+You can directly use the code from https://github.com/tensorflow/models/tree/master/research/slim with some appropriate modifications. We have provided the modified code for you. You can find it under examples/tensorflow/train/slim/code/
+
+## UAI Example
+We do following modifications to the train\_image\_classifier.py:
+
+Add UAI SDK related arguments parser: : --data\_dir, --output\_dir, --work\_dir, --log\_dir, --num\_gpus, these arguments are auto generated by UAI Train Platform and are defined in https://github.com/ucloud/uai-sdk/blob/master/uaitrain/arch/tensorflow/uflag.py.
+
+    # L16 in code/train_image_classifier.py
+    # import uflag from UAI-SDK
+    from preprocessing import preprocessing_factory 
+    from uaitrain.arch.tensorflow import uflag 
+    from uaitrain.arch.tensorflow import uai_dist
+
+    # L486 in code/train_image_classifier.py
+    # Translate UAI Args to Slim Args:
+    #      FLAGS.data_dir for data input path
+    #      FLAGS.output_dir for checkpoints and event logs
+    #      FLAGS.num_gpus for num of gpus
+    def main(_):
+      FLAGS.dataset_dir = FLAGS.data_dir 
+      FLAGS.train_dir = FLAGS.output_dir 
+      FLAGS.num_clones = FLAGS.num_gpus 
+
+### Packing TF-Slim code
+After Preparing the code (by replace the train_image_classifier.py), you can pack the tf-slim docker. We provide the slim.Dockerfile:
+
+    FROM uhub.service.ucloud.cn/uaishare/gpu_uaitrain_ubuntu-16.04_python-2.7.6_tensorflow_models:v1.8.0
+
+    COPY ./slim/ /data/
+
+It use the base image provided by UAI (tensorflow 1.6 with tf-models 1.8.0) and copy the code into /data/.
+
+Suppose you get the slim code, you can do the following steps to build your own UAI slim image:
+
+    $ cd ${SLIM_CODE_DIR}
+    $ ls slim/
+    BUILD datasets/ deployment/ ... train_image_classifier.py WORKSPACE
+    $ cp ${UAI-SLIM_EXAMPLE}/slim.Dockerfile .
+    $ cp ${UAI-SLIM_EXAMPLE}/train_image_classifier.py ./slim/
+    $ sudo docker build -t slim:test -f slim.Dockerfile .
+
+### Run Single Node Training
+You can use the docker image slim:test to run slim training locally with following cmd:
+
+    $ sudo nvidia-docker run -v /data/imagenet/tf-record/:/data/data/ -v /data/output/slim/:/data/output -it slim:test /bin/bash -c "python /data/train_image_classifier.py --data_dir=/data/data/ --output_dir=/data/output/ --num_gpus=1 --model_name=vgg_19"
+
+You can also run this docker image on UAI Train Platform. For more details please see https://docs.ucloud.cn/ai/uai-train.
+
+### Run Distributed Training
+The Slim code also support distributed training. It requires a distribted training environment and a dist-config. We also provided the dist-training example code in code/train\_image\_classifier.py. It accepts the standard Tensorflow Dist Config from TF\_CONFIG. (To learn more about distributed training please refer to https://www.tensorflow.org/deploy/distributed). 
+
+A standard TF\_CONFIG (compatable with the tf.estimator.Estimator API) looks like this:
+
+    TF_CONFIG = {
+    	"cluster":{
+    		"master":["ip0:2222"],
+    		"ps":["ip0:2223","ip1:2223"],
+    		"worker":["ip1:2222"]},
+    	"task":{"type":"worker","index":0},
+    	"environment":"cloud"
+    }
+
+We add a func \_get\_variables\_to\_train into code/train\_image\_classifier.py to parse the TF\_CONFIG and start the ps server and worker server.
+
+We also modify the deployment/model\_deploy.py to control the device assignment func for clone devices of DeploymentConfig:
+
+    # Line 578 in code/deployment/model\_deploy.py
+    # add task id for each replica
+    if self._num_ps_tasks > 0:
+      device = '%s/task:%d' % (device, self._replica_id) 
+
+    return device
+
+To pack the slit-dist docker image, you shoud do the following steps:
+
+    $ cd ${SLIM_CODE_DIR}
+    $ ls slim/
+    BUILD datasets/ deployment/ ... train_image_classifier.py WORKSPACE
+    $ cp ${UAI-SLIM_EXAMPLE}/slim.Dockerfile .
+    $ cp ${UAI-SLIM_EXAMPLE}/train_image_classifier.py ./slim/
+    $ cp ${UAI-SLIM_EXAMPLE}/deployment/model_deploy.py  ./slim/deployment
+    $ sudo docker build -t slim-dist:test -f slim.Dockerfile .
+
+Now you can run the dist train with same cmd as local training:
+
+    python /data/train_image_classifier.py --data_dir=/data/data/ --output_dir=/data/output/ --num_gpus=1 --model_name=vgg_19
+
+Note: you should generate TF\_CONFIG for each node in the cluster.
+
+Note: This example run the training with async-replica mode.
+
+#### Run Distributed Training On UAI Platform 
+UAI Train Platform can dynamicaaly deploy the training cluster and generate the TF\_CONFIG for each training node. You only need to run the training cmd as:
+
+    /data/train_image_classifier.py --batch_size=64 --model_name=vgg_19
+
+For more details please see https://docs.ucloud.cn/ai/uai-train.
+