Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Batch size scaling in DDP setups #60

Open
Delaunay opened this issue Feb 16, 2023 · 0 comments
Open

Investigate Batch size scaling in DDP setups #60

Delaunay opened this issue Feb 16, 2023 · 0 comments

Comments

@Delaunay
Copy link
Collaborator

Delaunay commented Feb 16, 2023

CI got 2 GPUs so we have full coverage of the planning methods.
Here is the comparison from 1 - 2 GPUs.

vit_l_32 : performance decreases; model too small ? (relies on NVLINK)
resnet152: performance stay the same, but it should scale linearly, the batch size is probably not scaled to use both GPUs

bench                Plan          metric     GPU1     GPU2
-----------------------------------------------------------
hf_t5                         train_rate     2.28     2.22
bert                          train_rate    18.64    17.85
learning_to_paint             train_rate   753.86   802.15
efficientnet_b4               train_rate    27.40    27.23
convnext_large                train_rate     2.11     2.10
ppo                  DDP      train_rate   918.81   834.84
resnet50                      train_rate    38.66    36.31
hf_reformer                   train_rate     3.86     3.71
soft_actor_critic             train_rate 12553.41 12258.97
super_slomo                   train_rate     1.35     1.29
dlrm                          train_rate 37923.96 38091.68
efficientnet_b0               train_rate    89.85    85.08
regnet_y_128gf                train_rate     1.23     1.26
td3                           train_rate 14678.08 13971.88
squeezenet1_1                 train_rate   173.54   163.83
vit_l_32             DDP      train_rate   100.73    51.80
resnet152            DDP      train_rate    87.15    96.54
stargan                       train_rate    26.83    24.76
efficientnet_b7               train_rate    11.41    11.18
speech_transformer            train_rate    35.99    35.87
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant