Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: MPMD detected but reload is not supported yet for neuron distributed environment with EAGER DEBUG MODE #21

Open
wenboqian opened this issue May 4, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@wenboqian
Copy link

Hi, I found the error MPMD detected but reload is not supported yet will occur if I open Eager Debug Mode for a model trained in neuron distributed environment where dp=1, tp=8, pp=4. Could you help check this issue? Thanks so much!

image

I attach related scripts here and you can simply run ./run_simple_model_tp_pp.sh after download them.

scripts.zip

Environment information:

EC2 Instance: trn1.32.xlarge

OS: Ubuntu 20.04

Neuron Pytorch: Latest 2.18

@aws-rhsoln
Copy link
Contributor

Thank you for reporting the issue. MPMD detected error means that 1 worker is trying to load a new graph vs the other worker is waiting to perform inference, and hence the graphs that have collectives are not able to communicate with each other. Here the assumption is that each worker performs the same set of operations and hence all workers would be in SPMD mode. This assumption is broken and hence you see the above error.
We are looking into the scripts now to identify which worker is producing a new graph (and why) and if we can modify the scripts such that graphs with collectives do not change from iteration to another. Will report back once we have an update.

@wenboqian wenboqian changed the title Error: MPMD detected but reload is not supported yet for neuron distributed environment Error: MPMD detected but reload is not supported yet for neuron distributed environment with EAGER DEBUG MODE May 5, 2024
@wenboqian
Copy link
Author

Thank you for reporting the issue. MPMD detected error means that 1 worker is trying to load a new graph vs the other worker is waiting to perform inference, and hence the graphs that have collectives are not able to communicate with each other. Here the assumption is that each worker performs the same set of operations and hence all workers would be in SPMD mode. This assumption is broken and hence you see the above error. We are looking into the scripts now to identify which worker is producing a new graph (and why) and if we can modify the scripts such that graphs with collectives do not change from iteration to another. Will report back once we have an update.

Thanks for your detailed explain! And I'd like to know that: Will this error interrupt the process of generating hlo graphs?
I mean, my assumption is that I can't get all hlo graphs because the graph generation process will be shut down due to this error. Is my assumption right?

@aws-rhsoln
Copy link
Contributor

So at a time, only one graph can be executed. In this case, since the run errored out at this graph, you can generate only graphs upto this point. If you want to generate all the graphs without worrying about execution, you can run with neuron_parallel_compile . The utility should help to extract all the HLOs, compile them and save them in the cache.

@aws-rhsoln
Copy link
Contributor

We have managed to reproduce the issue. The issue happens only with eager mode. There seems to be bug which causes two collectives with different replica groups to be part of the same graph, such that it now has to communicate with two graphs at the same time. If you look at one of the graphs, it produces the following collectives in the same graph:

%all-reduce = (bf16[1,4]{1,0}, bf16[]) all-reduce(bf16[1,4]{1,0} %p1, bf16[] %convert), replica_groups={{0,8},{16,24},{1,9},{17,25},{2,10},{18,26},{3,11},{19,27},{4,12},{20,28},{5,13},{21,29},{6,14},{22,30},{7,15},{23,31}}, constrain_layout=true, to_apply=%AddComputation.7, metadata={op_type="xla__cross_replica_sum" op_name="NxDModel[model]/NxDPPModel[module]/xla__cross_replica_sum" source_file="/home/ubuntu/aws_neuron_venv/lib/python3.8/site-packages/torch_xla/core/xla_model.py" source_line=590}
...
%all-reduce.1 = (bf16[1,4]{1,0}, bf16[]) all-reduce(bf16[1,4]{1,0} %p2, bf16[] %convert.2), replica_groups={{8,16},{24,0},{9,17},{25,1},{10,18},{26,2},{11,19},{27,3},{12,20},{28,4},{13,21},{29,5},{14,22},{30,6},{15,23},{31,7}}, constrain_layout=true, to_apply=%AddComputation.20, metadata={op_type="xla__cross_replica_sum" op_name="NxDModel[model]/NxDPPModel[module]/xla__cross_replica_sum" source_file="/home/ubuntu/aws_neuron_venv/lib/python3.8/site-packages/torch_xla/core/xla_model.py" source_line=590}

As you see it is trying to send and receive a tensor at the same time. Ideally they should be in two separate graphs. We will look into this issue and update on this ticket when we have a fix.
Note: Eager debug mode is meant mainly for single worker workloads, and distributed workload with eager mode is not supported yet.
If the intention is mainly to debug the script, you can make use of this guide: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programm[…]training/pytorch-neuron-debug.html?highlight=print

@jeffhataws
Copy link
Contributor

@wfckl789 just want to check in to see if you were you able to make forward progress?

@wenboqian
Copy link
Author

wenboqian commented May 16, 2024

@wfckl789 just want to check in to see if you were you able to make forward progress?

In this case as I obeserved, the cc compiler raised this compilation fault before the forward progress was executed. So I think the forward progress didn't make it because I didn't see the loss value from the first epoch.

@aws-rhsoln
Copy link
Contributor

Any particular reason for trying eager mode in mutli-worker case? Note: Eager debug mode is only for debugging and is not the most performant mode of execution

@aws-taylor aws-taylor added the bug Something isn't working label Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants