Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to train a bert model with distributed training ? #39

Closed
MarvinLong opened this issue May 8, 2019 · 5 comments
Closed

how to train a bert model with distributed training ? #39

MarvinLong opened this issue May 8, 2019 · 5 comments

Comments

@MarvinLong
Copy link

No description provided.

@MarvinLong MarvinLong reopened this May 8, 2019
@MarvinLong
Copy link
Author

I saw a bert is trained with 64 GPUs in 3.3 days in this article
https://medium.com/future-vision/bert-meets-gpus-403d3fbed848?fbclid=IwAR0bFskUVVKDRyYF-9cQGgRXeq7dTvteGHi10HaTG5zI7_eE8oW-BfrxYQw

I want to know is this repo or the pytorch version in https://github.com/NVIDIA/Megatron-LM ?
Can you help me to train a bert model with distributed training, thanks.

@MarvinLong MarvinLong changed the title how how to train a bert model with distributed training ? May 8, 2019
@swethmandava
Copy link
Contributor

We currently only published scripts for single node training. Stay tuned for distributed multi node training scripts, we will publish them soon.

@LifeIsStrange
Copy link

LifeIsStrange commented Aug 20, 2019

@swethmandava
Why doesn't Megatron allow us to open issues?
For example it would be nice if it supported
https://github.com/zihangdai/xlnet
Which is the new state of the art (consistently beat BERT) as you can see on paperswithcode.com
And it does not yet support multi gpu zihangdai/xlnet#218

(it would be nice to support ERNIE 2.0 too but less of a priority)

@LifeIsStrange
Copy link

BTW nvidia is already contributing to xlnet e.g this nvidia employee:
zihangdai/xlnet#200
So let's be consistent

@swethmandava
Copy link
Contributor

Multi node training is now supported from #208

@LifeIsStrange you can also open issues on Megatron-LM now, thanks for contributing!

roywei pushed a commit to roywei/DeepLearningExamples that referenced this issue Apr 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants