-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate Deepspeed/HuggingFace slowness in finetuner #171
Comments
@harubaru Where are we with this? Did the performance reporting you do have any insight? |
The investigating that I have done has mainly revolved around using different ZeRO stages and trying out different hyperparameters. Different optimizers could not be used due to lacking a proper NCCL dependency in the base Torch image for the trainer, but besides that, here are some of the things that can most definitely improve training speed is:
These are also the changeable factors (meaning: variables that can be adjusted through the workflow) that affect the training speed:
* The runs for testing the timings for GAS uses two GPUs instead of one. They also use a different version of the finetuner that is currently being tested in #128, so those tests have to be reran but the above recommendations for improving training speed should remain the same regardless. For future work, we should definitely look into seeing if we can use a different optimizer as CPU AdamW has a ridiculously high amount of performance overhead. There are also possible methods that we could try out such as incorporating flash-attention and using fused kernels for the optimizers which would decrease memory usage further, however the former of which requires a lot of monkey patching, and the latter of which would need more investigating as DeepSpeed does support fused Adam out of the box. |
Marking this as done, as investigation is complete. Write an issue for using a different optimizer? |
DeepSpeed and Huggingface appears to be slowing training down significantly. We should investigate why -- it may be the optimizer states.
The text was updated successfully, but these errors were encountered: