Adding GPU automatic mixed precision training #200
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Automatic Mixed Precision training on GPU for Tensorflow has been recently introduced:
https://medium.com/tensorflow/automatic-mixed-precision-in-tensorflow-for-faster-ai-training-on-nvidia-gpus-6033234b2540
Automatic mixed precision training makes use of both FP32 and FP16 precisions where appropriate. FP16 operations can leverage the Tensor cores on NVIDIA GPUs (Volta, Turing or newer architectures) for much improved throughput.
This PR adds GPU automatic mixed precision training to
xlnet
classification fine-tuning task via setting the flag--gpu_auto_mixed_precision=True
.Here's a Colab notebook demonstrating the usage of automatic mixed precision on the
xlnet
classifier fine-tuning task:https://colab.research.google.com/drive/1dPX1L8MAHvLxBAcwgRLHmh9TdIB3bQjx
On a V100 GPU, this results in about 50% increased throughput.
How mixed precision works
Mixed precision is the use of both float16 and float32 data types when training a model.
Performing arithmetic operations in float16 takes advantage of the performance gains of using specialized processing units such as the Tensor cores on NVIDIA GPUs. Due to the smaller representable range of float16, performing the entire training with float16 data type can result in underflow of the gradients, leading to convergence or model quality issues.
However, performing only select arithmetic operations in float16 results in performance gains when using compatible hardware accelerators, decreasing training time and reducing memory usage, typically without sacrificing model performance.
To learn more about mixed precision and how it works:
Overview of Automatic Mixed Precision for Deep Learning
NVIDIA Mixed Precision Training Documentation
NVIDIA Deep Learning Performance Guide