-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf: use fused Adam optimizer #4463
Conversation
📝 WalkthroughWalkthroughThe pull request modifies the optimizer initialization in the Changes
Possibly related PRs
Suggested labels
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
deepmd/pt/train/training.py (1)
582-582
: Great performance optimization!The addition of
fused=True
to the Adam optimizer is a valuable performance improvement, showing significant speedups in both optimizer update time (2.75x) and total training time (3%).However, since fused Adam is only available for CUDA tensors, we should add a CUDA availability check. Apply this diff:
- self.wrapper.parameters(), lr=self.lr_exp.start_lr, fused=True + self.wrapper.parameters(), + lr=self.lr_exp.start_lr, + fused=torch.cuda.is_available() # Enable fused Adam only when CUDA is available
Running with CPU is still possible with this parameter set. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## devel #4463 +/- ##
==========================================
- Coverage 83.75% 83.75% -0.01%
==========================================
Files 667 667
Lines 61513 61515 +2
Branches 3486 3486
==========================================
- Hits 51523 51521 -2
- Misses 8865 8867 +2
- Partials 1125 1127 +2 ☔ View full report in Codecov by Sentry. |
It might be wrong to benchmark the time on the CPU since the GPU is asynchronous, which is the bottleneck. |
@njzjz What you've mentioned is the general case for NN itself. However, for optimizer update in deepmd-kit, the bottleneck is at the CPU side. You can refer to the "GPU SM efficiency" row from the profiling plot. |
This PR sets the Adam optimizer to use the `fused=True` parameter. For the profiling result shown below, this modification brings an 2.75x improvement on optimizer update (22ms vs. 8ms) and ~3% improvement for total speed up (922ms vs. 892ms). The benchmark case is training a DPA-2 Q3 release model. Please note that the absolute time may differs between steps. <details><summary>Before</summary> <p> ![image](https://github.com/user-attachments/assets/d6b05a1d-6e6c-478d-921f-c497718bc551) </p> </details> <details><summary>After</summary> <p> ![image](https://github.com/user-attachments/assets/b216b919-094c-441f-96a7-146e1e3db483) </p> </details> [Ref](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html): > The foreach and fused implementations are typically faster than the for-loop, single-tensor implementation, with **fused being theoretically fastest** with both vertical and horizontal fusion. As such, if the user has not specified either flag (i.e., when foreach = fused = None), we will attempt defaulting to the foreach implementation when the tensors are all on CUDA. Why not fused? Since the fused implementation is relatively new, we want to give it sufficient bake-in time. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Bug Fixes** - Improved optimizer performance during training by modifying the initialization of the Adam optimizer. - **Documentation** - Updated method signature for clarity in the `Trainer` class. <!-- end of auto-generated comment: release notes by coderabbit.ai --> (cherry picked from commit 104fc36)
This PR sets the Adam optimizer to use the `fused=True` parameter. For the profiling result shown below, this modification brings an 2.75x improvement on optimizer update (22ms vs. 8ms) and ~3% improvement for total speed up (922ms vs. 892ms). The benchmark case is training a DPA-2 Q3 release model. Please note that the absolute time may differs between steps. <details><summary>Before</summary> <p> ![image](https://github.com/user-attachments/assets/d6b05a1d-6e6c-478d-921f-c497718bc551) </p> </details> <details><summary>After</summary> <p> ![image](https://github.com/user-attachments/assets/b216b919-094c-441f-96a7-146e1e3db483) </p> </details> [Ref](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html): > The foreach and fused implementations are typically faster than the for-loop, single-tensor implementation, with **fused being theoretically fastest** with both vertical and horizontal fusion. As such, if the user has not specified either flag (i.e., when foreach = fused = None), we will attempt defaulting to the foreach implementation when the tensors are all on CUDA. Why not fused? Since the fused implementation is relatively new, we want to give it sufficient bake-in time. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Bug Fixes** - Improved optimizer performance during training by modifying the initialization of the Adam optimizer. - **Documentation** - Updated method signature for clarity in the `Trainer` class. <!-- end of auto-generated comment: release notes by coderabbit.ai --> (cherry picked from commit 104fc36)
This PR sets the Adam optimizer to use the
fused=True
parameter.For the profiling result shown below, this modification brings an 2.75x improvement on optimizer update (22ms vs. 8ms) and ~3% improvement for total speed up (922ms vs. 892ms). The benchmark case is training a DPA-2 Q3 release model. Please note that the absolute time may differs between steps.
Before
After
Ref:
Summary by CodeRabbit
Bug Fixes
Documentation
Trainer
class.