-
Notifications
You must be signed in to change notification settings - Fork 27
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add dist training support for UAI Training system.
UAI Train now support TensorFlow and MXNet dist train. uaitrain/arch/tensorflow/uai_dist.py gives tensorflow dist config parser Also add example of dist training: 1. cifar 2. slim 3. wide-deep
- Loading branch information
宋翔
committed
Apr 16, 2018
1 parent
7379d94
commit 3301852
Showing
9 changed files
with
1,882 additions
and
19 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
# TF Slim Example | ||
TF-slim is a new lightweight high-level API of TensorFlow (tensorflow.contrib.slim) for defining, training and evaluating complex models. Here we shows how to run TF-slim example code on UAI Platform. This example is based on https://github.com/tensorflow/models/tree/master/research/slim | ||
|
||
## Setup | ||
You should follow the "Preparing the datasets" of https://github.com/tensorflow/models/tree/master/research/slim to download and preparing the tfrecord format data. For imagenet dataset it will looks like: | ||
|
||
$ ls /data/imagenet/tf-record/ | ||
train-00000-of-01024.tfrecord | ||
train-00001-of-01024.tfrecord | ||
... | ||
train-01023-of-01024.tfrecord | ||
validation-00000-of-00128.tfrecord | ||
... | ||
validation-00127-of-00128.tfrecord | ||
labels.txt | ||
|
||
## Intro | ||
You can directly use the code from https://github.com/tensorflow/models/tree/master/research/slim with some appropriate modifications. We have provided the modified code for you. You can find it under examples/tensorflow/train/slim/code/ | ||
|
||
## UAI Example | ||
We do following modifications to the train\_image\_classifier.py: | ||
|
||
Add UAI SDK related arguments parser: : --data\_dir, --output\_dir, --work\_dir, --log\_dir, --num\_gpus, these arguments are auto generated by UAI Train Platform and are defined in https://github.com/ucloud/uai-sdk/blob/master/uaitrain/arch/tensorflow/uflag.py. | ||
|
||
# L16 in code/train_image_classifier.py | ||
# import uflag from UAI-SDK | ||
from preprocessing import preprocessing_factory | ||
from uaitrain.arch.tensorflow import uflag | ||
from uaitrain.arch.tensorflow import uai_dist | ||
|
||
# L486 in code/train_image_classifier.py | ||
# Translate UAI Args to Slim Args: | ||
# FLAGS.data_dir for data input path | ||
# FLAGS.output_dir for checkpoints and event logs | ||
# FLAGS.num_gpus for num of gpus | ||
def main(_): | ||
FLAGS.dataset_dir = FLAGS.data_dir | ||
FLAGS.train_dir = FLAGS.output_dir | ||
FLAGS.num_clones = FLAGS.num_gpus | ||
|
||
### Packing TF-Slim code | ||
After Preparing the code (by replace the train_image_classifier.py), you can pack the tf-slim docker. We provide the slim.Dockerfile: | ||
|
||
FROM uhub.service.ucloud.cn/uaishare/gpu_uaitrain_ubuntu-16.04_python-2.7.6_tensorflow_models:v1.8.0 | ||
|
||
COPY ./slim/ /data/ | ||
|
||
It use the base image provided by UAI (tensorflow 1.6 with tf-models 1.8.0) and copy the code into /data/. | ||
|
||
Suppose you get the slim code, you can do the following steps to build your own UAI slim image: | ||
|
||
$ cd ${SLIM_CODE_DIR} | ||
$ ls slim/ | ||
BUILD datasets/ deployment/ ... train_image_classifier.py WORKSPACE | ||
$ cp ${UAI-SLIM_EXAMPLE}/slim.Dockerfile . | ||
$ cp ${UAI-SLIM_EXAMPLE}/train_image_classifier.py ./slim/ | ||
$ sudo docker build -t slim:test -f slim.Dockerfile . | ||
|
||
### Run Single Node Training | ||
You can use the docker image slim:test to run slim training locally with following cmd: | ||
|
||
$ sudo nvidia-docker run -v /data/imagenet/tf-record/:/data/data/ -v /data/output/slim/:/data/output -it slim:test /bin/bash -c "python /data/train_image_classifier.py --data_dir=/data/data/ --output_dir=/data/output/ --num_gpus=1 --model_name=vgg_19" | ||
|
||
You can also run this docker image on UAI Train Platform. For more details please see https://docs.ucloud.cn/ai/uai-train. | ||
|
||
### Run Distributed Training | ||
The Slim code also support distributed training. It requires a distribted training environment and a dist-config. We also provided the dist-training example code in code/train\_image\_classifier.py. It accepts the standard Tensorflow Dist Config from TF\_CONFIG. (To learn more about distributed training please refer to https://www.tensorflow.org/deploy/distributed). | ||
|
||
A standard TF\_CONFIG (compatable with the tf.estimator.Estimator API) looks like this: | ||
|
||
TF_CONFIG = { | ||
"cluster":{ | ||
"master":["ip0:2222"], | ||
"ps":["ip0:2223","ip1:2223"], | ||
"worker":["ip1:2222"]}, | ||
"task":{"type":"worker","index":0}, | ||
"environment":"cloud" | ||
} | ||
|
||
We add a func \_get\_variables\_to\_train into code/train\_image\_classifier.py to parse the TF\_CONFIG and start the ps server and worker server. | ||
|
||
We also modify the deployment/model\_deploy.py to control the device assignment func for clone devices of DeploymentConfig: | ||
|
||
# Line 578 in code/deployment/model\_deploy.py | ||
# add task id for each replica | ||
if self._num_ps_tasks > 0: | ||
device = '%s/task:%d' % (device, self._replica_id) | ||
|
||
return device | ||
|
||
To pack the slit-dist docker image, you shoud do the following steps: | ||
|
||
$ cd ${SLIM_CODE_DIR} | ||
$ ls slim/ | ||
BUILD datasets/ deployment/ ... train_image_classifier.py WORKSPACE | ||
$ cp ${UAI-SLIM_EXAMPLE}/slim.Dockerfile . | ||
$ cp ${UAI-SLIM_EXAMPLE}/train_image_classifier.py ./slim/ | ||
$ cp ${UAI-SLIM_EXAMPLE}/deployment/model_deploy.py ./slim/deployment | ||
$ sudo docker build -t slim-dist:test -f slim.Dockerfile . | ||
|
||
Now you can run the dist train with same cmd as local training: | ||
|
||
python /data/train_image_classifier.py --data_dir=/data/data/ --output_dir=/data/output/ --num_gpus=1 --model_name=vgg_19 | ||
|
||
Note: you should generate TF\_CONFIG for each node in the cluster. | ||
|
||
Note: This example run the training with async-replica mode. | ||
|
||
#### Run Distributed Training On UAI Platform | ||
UAI Train Platform can dynamicaaly deploy the training cluster and generate the TF\_CONFIG for each training node. You only need to run the training cmd as: | ||
|
||
/data/train_image_classifier.py --batch_size=64 --model_name=vgg_19 | ||
|
||
For more details please see https://docs.ucloud.cn/ai/uai-train. | ||
|
Oops, something went wrong.