-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error: MPMD detected but reload is not supported yet for neuron distributed environment with EAGER DEBUG MODE #21
Comments
Thank you for reporting the issue. |
Thanks for your detailed explain! And I'd like to know that: Will this error interrupt the process of generating hlo graphs? |
So at a time, only one graph can be executed. In this case, since the run errored out at this graph, you can generate only graphs upto this point. If you want to generate all the graphs without worrying about execution, you can run with neuron_parallel_compile . The utility should help to extract all the HLOs, compile them and save them in the cache. |
We have managed to reproduce the issue. The issue happens only with eager mode. There seems to be bug which causes two collectives with different replica groups to be part of the same graph, such that it now has to communicate with two graphs at the same time. If you look at one of the graphs, it produces the following collectives in the same graph:
As you see it is trying to send and receive a tensor at the same time. Ideally they should be in two separate graphs. We will look into this issue and update on this ticket when we have a fix. |
@wfckl789 just want to check in to see if you were you able to make forward progress? |
In this case as I obeserved, the cc compiler raised this compilation fault before the forward progress was executed. So I think the forward progress didn't make it because I didn't see the loss value from the first epoch. |
Any particular reason for trying eager mode in mutli-worker case? Note: Eager debug mode is only for debugging and is not the most performant mode of execution |
Hi, I found the error
MPMD detected but reload is not supported yet
will occur if I openEager Debug Mode
for a model trained in neuron distributed environment where dp=1, tp=8, pp=4. Could you help check this issue? Thanks so much!I attach related scripts here and you can simply run
./run_simple_model_tp_pp.sh
after download them.scripts.zip
Environment information:
EC2 Instance: trn1.32.xlarge
OS: Ubuntu 20.04
Neuron Pytorch: Latest 2.18
The text was updated successfully, but these errors were encountered: