Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding GPU automatic mixed precision training #200

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

vinhngx
Copy link

@vinhngx vinhngx commented Aug 1, 2019

Automatic Mixed Precision training on GPU for Tensorflow has been recently introduced:

https://medium.com/tensorflow/automatic-mixed-precision-in-tensorflow-for-faster-ai-training-on-nvidia-gpus-6033234b2540

Automatic mixed precision training makes use of both FP32 and FP16 precisions where appropriate. FP16 operations can leverage the Tensor cores on NVIDIA GPUs (Volta, Turing or newer architectures) for much improved throughput.

This PR adds GPU automatic mixed precision training to xlnet classification fine-tuning task via setting the flag --gpu_auto_mixed_precision=True.

Here's a Colab notebook demonstrating the usage of automatic mixed precision on the xlnet classifier fine-tuning task:
https://colab.research.google.com/drive/1dPX1L8MAHvLxBAcwgRLHmh9TdIB3bQjx

On a V100 GPU, this results in about 50% increased throughput.

How mixed precision works

Mixed precision is the use of both float16 and float32 data types when training a model.

Performing arithmetic operations in float16 takes advantage of the performance gains of using specialized processing units such as the Tensor cores on NVIDIA GPUs. Due to the smaller representable range of float16, performing the entire training with float16 data type can result in underflow of the gradients, leading to convergence or model quality issues.

However, performing only select arithmetic operations in float16 results in performance gains when using compatible hardware accelerators, decreasing training time and reducing memory usage, typically without sacrificing model performance.

To learn more about mixed precision and how it works:

Overview of Automatic Mixed Precision for Deep Learning
NVIDIA Mixed Precision Training Documentation
NVIDIA Deep Learning Performance Guide

@vinhngx
Copy link
Author

vinhngx commented Aug 27, 2019

Would you mind reviewing this PR @zihangdai?

@LifeIsStrange
Copy link

@vinhngx I find it deeply sad that the number one state of the art transformer model is in state of abandonware.

How many downstream papers did not benefit from FP16 (and other community improvements) because of this?

Do you know any XLnet reimplementation that offers mixed precision support?
Maybe the Hugging Face one? Or the tensorflow one?

Anyway my point on abandonware is more general and impactful than that.
I see that you work at Nvidia. It is in the interest of Nvidia that enterprises uses in production SOTA systems (as they use CUDA). But a lot of enterprises will not use machine learning where they could because many tasks requires decent accuracy (meaning that those companies uss cases require to use in production the state of the art).

Nowadays the state of the art is easily discoverable on Nlp progress or on paperswithcode.com
BUT soo many state of the art codes are in an abandonware, unmaintained state!

If Nvidia could in a friendly manner, inject ~0.1% of, its workforce on AI, to fork and actually maintain SOTA engeenering solutions to ML problems, a lot of enterprise could easily and with confidence begin to use AI (and therefore Nvidia GPUs) for tasks where it is already underused or for new tasks.

The number of likes of https://github.com/huggingface/transformers just reflect how important of a need this is. But the library only address general pre training.

As an example, I tried to use state of the art coreference resolution for my semantic parser:
https://github.com/sebastianruder/NLP-progress/blob/master/english/coreference_resolution.md
I tried ALL OF THEM, and they are all abandonware, their dependencies are broken and their API / usage cryptic. And they might have a number of low hanging fruit improvements that would be really nice to have.
I tried all of them and couldn't make any of them work! The state of NLP/NLU is a joke and proof how much is is not industry/production ready.

Nvidia must understand that it is in its best interest to maintain the most important state of the art libs. By doing so its reputation would shine and AI usage in enterprises would too, and as such GPU sales too.

In an ideal world Nvidia would build a framework {spacy / corenlp / stanza} - like but with the big difference vs others of being always (mostly) up to date with the state of the art and by reusing the same API and benefiting from SOTAs upgrades without breaking changes for downstream users (like huggingface transformers generic API achieve).

But not need of this ideal world, if Nvidia could just maintain the gist of the SOTA (in N libs and not on a unified framework) that would still be the breakthrough of the decade in NLP enterprise friendliness.
It would be a game changer for humanity.

What do you think about this @vinhngx ?
I would like you to share this idea and this need to other Nvidia AI engineers :)

@zihangdai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants