how to train a bert model with distributed training ? #39

MarvinLong · 2019-05-08T09:04:38Z

No description provided.

MarvinLong · 2019-05-08T09:11:28Z

I saw a bert is trained with 64 GPUs in 3.3 days in this article
https://medium.com/future-vision/bert-meets-gpus-403d3fbed848?fbclid=IwAR0bFskUVVKDRyYF-9cQGgRXeq7dTvteGHi10HaTG5zI7_eE8oW-BfrxYQw

I want to know is this repo or the pytorch version in https://github.com/NVIDIA/Megatron-LM ?
Can you help me to train a bert model with distributed training, thanks.

swethmandava · 2019-05-15T16:18:43Z

We currently only published scripts for single node training. Stay tuned for distributed multi node training scripts, we will publish them soon.

LifeIsStrange · 2019-08-20T15:58:22Z

@swethmandava
Why doesn't Megatron allow us to open issues?
For example it would be nice if it supported
https://github.com/zihangdai/xlnet
Which is the new state of the art (consistently beat BERT) as you can see on paperswithcode.com
And it does not yet support multi gpu zihangdai/xlnet#218

(it would be nice to support ERNIE 2.0 too but less of a priority)

LifeIsStrange · 2019-08-20T15:59:12Z

BTW nvidia is already contributing to xlnet e.g this nvidia employee:
zihangdai/xlnet#200
So let's be consistent

swethmandava · 2019-09-19T20:04:45Z

Multi node training is now supported from #208

@LifeIsStrange you can also open issues on Megatron-LM now, thanks for contributing!

MarvinLong closed this as completed May 8, 2019

MarvinLong reopened this May 8, 2019

MarvinLong changed the title ~~how~~ how to train a bert model with distributed training ? May 8, 2019

MarvinLong closed this as completed May 22, 2019

roywei pushed a commit to roywei/DeepLearningExamples that referenced this issue Apr 8, 2021

added tf 2.4 script changes to MaskRCNN (NVIDIA#39)

1d13295

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to train a bert model with distributed training ? #39

how to train a bert model with distributed training ? #39

MarvinLong commented May 8, 2019

MarvinLong commented May 8, 2019

swethmandava commented May 15, 2019

LifeIsStrange commented Aug 20, 2019 •

edited

Loading

LifeIsStrange commented Aug 20, 2019

swethmandava commented Sep 19, 2019

how to train a bert model with distributed training ? #39

how to train a bert model with distributed training ? #39

Comments

MarvinLong commented May 8, 2019

MarvinLong commented May 8, 2019

swethmandava commented May 15, 2019

LifeIsStrange commented Aug 20, 2019 • edited Loading

LifeIsStrange commented Aug 20, 2019

swethmandava commented Sep 19, 2019

LifeIsStrange commented Aug 20, 2019 •

edited

Loading