Skip to content

Commit

Permalink
pd: fix learning rate setting when resume (#4480)
Browse files Browse the repository at this point in the history
"When resuming training, there is no need to add `self.start_step` to
the step count because Paddle uses `lr_sche.last_epoch` as the input for
`step`, which already records the `start_step` steps."

learning rate are correct after fixing


![22AD6874B74E437E9B133D75ABCC02FE](https://github.com/user-attachments/assets/1ad0ce71-6e1c-4de5-87dc-0daca1f6f038)



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Enhanced training process with improved optimizer configuration and
learning rate adjustments.
	- Refined logging of training and validation results for clarity.
- Improved model saving logic to preserve the latest state during
interruptions.
- Enhanced tensorboard logging for detailed tracking of training
metrics.

- **Bug Fixes**
- Corrected lambda function for learning rate scheduler to reference
warmup steps accurately.

- **Chores**
- Streamlined data loading and handling for efficient training across
different tasks.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
  • Loading branch information
HydrogenSulfate authored Dec 20, 2024
1 parent c0914e1 commit c24498b
Showing 1 changed file with 2 additions and 3 deletions.
5 changes: 2 additions & 3 deletions deepmd/pd/train/training.py
Original file line number Diff line number Diff line change
Expand Up @@ -588,15 +588,14 @@ def warm_up_linear(step, warmup_steps):
if self.opt_type == "Adam":
self.scheduler = paddle.optimizer.lr.LambdaDecay(
learning_rate=self.lr_exp.start_lr,
lr_lambda=lambda step: warm_up_linear(
step + self.start_step, self.warmup_steps
),
lr_lambda=lambda step: warm_up_linear(step, self.warmup_steps),
)
self.optimizer = paddle.optimizer.Adam(
learning_rate=self.scheduler, parameters=self.wrapper.parameters()
)
if optimizer_state_dict is not None and self.restart_training:
self.optimizer.set_state_dict(optimizer_state_dict)
self.scheduler.last_epoch -= 1
else:
raise ValueError(f"Not supported optimizer type '{self.opt_type}'")

Expand Down

0 comments on commit c24498b

Please sign in to comment.