You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using the latest main to train a YoloV9e object detector:
[rank0]: train_one_epoch(train_loader, model, args, model_dtype)
[rank0]: File "/mnt/dingus_drive/catid/train_detector/train.py", line 90, in train_one_epoch
[rank0]: model.step()
[rank0]: File "/home/saronic/miniconda3/envs/train/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2213, in step
[rank0]: self._take_model_step(lr_kwargs)
[rank0]: File "/home/saronic/miniconda3/envs/train/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2119, in _take_model_step
[rank0]: self.optimizer.step()
[rank0]: File "/home/saronic/miniconda3/envs/train/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/saronic/miniconda3/envs/train/lib/python3.10/site-packages/deepspeed/runtime/bf16_optimizer.py", line 303, in step
[rank0]: self.optimizer.step()
[rank0]: File "/home/saronic/miniconda3/envs/train/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 130, in wrapper
[rank0]: return func.__get__(opt, opt.__class__)(*args, **kwargs)
[rank0]: File "/home/saronic/miniconda3/envs/train/lib/python3.10/site-packages/torch/optim/optimizer.py", line 484, in wrapper
[rank0]: out = func(*args, **kwargs)
[rank0]: File "/home/saronic/miniconda3/envs/train/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/saronic/miniconda3/envs/train/lib/python3.10/site-packages/distributed_shampoo/distributed_shampoo.py", line 1165, in step
[rank0]: ].merge_and_block_gradients()
[rank0]: File "/home/saronic/miniconda3/envs/train/lib/python3.10/site-packages/distributed_shampoo/utils/shampoo_distributor.py", line 300, in merge_and_block_gradients
[rank0]: local_masked_blocked_grads = self._merge_and_block_gradients()
[rank0]: File "/home/saronic/miniconda3/envs/train/lib/python3.10/site-packages/distributed_shampoo/utils/shampoo_distributor.py", line 211, in _merge_and_block_gradients
[rank0]: grad.view(merged_dims), self._param_group[MAX_PRECONDITIONER_DIM]
[rank0]: RuntimeError: shape '[1728]' is invalid for input of size 7268980
Looks like there's some issue with this code when used from DeepSpeed?
The text was updated successfully, but these errors were encountered:
Hi @catid-saronic, thanks for your interest in our code! We have not tested using our Shampoo code with DeepSpeed. For scaling up models, we have preliminary support for FSDP; however, this does require some model information.
If you're interested in getting things working with DeepSpeed, would be happy to help though. Let me know if you have any other questions.
Using the latest main to train a YoloV9e object detector:
Looks like there's some issue with this code when used from DeepSpeed?
The text was updated successfully, but these errors were encountered: