Template scripts to setup Docker Images compatible with running on MNP Batch
This sample code is made available under a modified MIT license. See the LICENSE file.
To build a Tensorflow reference docker image compatible with running tightly coupled multi-node parallel batch jobs on AWS Batch. Build platform requires installation of nvidia-docker2.
git clone https://github.com/aws-samples/aws-mnpbatch-template.git
cd aws-mnpbatch-template
docker build -t nvidia/mnp-batch-tensorflow .
The Dockerfile can mostly be reused for your application, It installs the following stack:
Ubuntu 18.04 nvidia/cuda base docker image
APT packages for dependencies
SSH SETUP
S3 OPTIMIZATION
CUDA-AWARE OpenMPI 4.0.0
TENSORFLOW/HOROVOD INSTALL
IMAGENET DATASET
SUPERVISOR DOCKER CONTAINER STARTUP
Thus if you want apply your own customizations and application, you just need to modify the MPI, Tensorflow layers. Also custom build scripts are located in conf/
.
Finally replace the section in supervised-scripts/mpi-run.sh
to support the MPI startup of your custom application. The script logic will prepare the mpi machine file. If your node contains GPUs then the slots=
with be the number of GPUs per node.
if [ -x "$(command -v nvidia-smi)" ] ; then
NUM_GPUS=$(ls -l /dev/nvidia[0-9] | wc -l)
availablecores=$NUM_GPUS
else
availablecores=$(nproc)
fi
If not then it will based on the vCPUs/Cores where applicable and passed as ${HOST_FILE_PATH}-deduped
. Any extra MPI parameters at job runtime will be passed into the $EXTRA_MPI_PARAMS
.
wait_for_nodes () {
.
.
.
aws s3 cp $S3_INPUT $SCRATCH_DIR
#tar -xvf $SCRATCH_DIR/*.tar.gz -C $SCRATCH_DIR
cd $SCRATCH_DIR
export INTERFACE=eth0
export MODEL_HOME=/root/deep-learning-models/models/resnet/tensorflow
/opt/openmpi/bin/mpirun --allow-run-as-root -np $MPI_GPUS --machinefile ${HOST_FILE_PATH}-deduped -mca plm_rsh_no_tree_spawn 1 \
-bind-to socket -map-by slot \
-x HOROVOD_HIERARCHICAL_ALLREDUCE=1 -x HOROVOD_FUSION_THRESHOLD=16777216 \
-x NCCL_MIN_NRINGS=4 -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib \
-x NCCL_SOCKET_IFNAME=$INTERFACE -mca btl_tcp_if_include $INTERFACE \
$EXTRA_MPI_PARAMS -x TF_CPP_MIN_LOG_LEVEL=0 \
python3 -W ignore $MODEL_HOME/train_imagenet_resnet_hvd.py \
--data_dir $JOB_DIR --num_epochs 90 -b $BATCH_SIZE \
--lr_decay_mode poly --warmup_epochs 10 --clear_log
sleep 2
#tar -czvf $JOB_DIR/batch_output_$AWS_BATCH_JOB_ID.tar.gz $SCRATCH_DIR/*
#aws s3 cp $JOB_DIR/batch_output_$AWS_BATCH_JOB_ID.tar.gz $S3_OUTPUT
}
Once built you can commit this docker image to your AWS Elastic Container Registry (ECR) using these instructions.