-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 about learning-rate #59
Comments
Hi, that's why we love open source! |
Thank you, it is my pleasure, my email is [email protected], if you have any questions or I can help, you can contact me |
Co-authored-by: hanruisong00 <[email protected]> Co-authored-by: Adrien Lafage <[email protected]>
Hello, after studying your paper, it has been very inspiring for my work. I still have some questions that I would like to consult with you
For the learning rate issue of training, what your code means is that the loss function is the average value of the losses of all experts. If there are four experts, then for each expert, the actual loss is divided by four, which means that when backpropagation is used to calculate the gradient, it will also be divided by four. Do you need to initially set a learning rate that is four times larger than the single model
The text was updated successfully, but these errors were encountered: