Error: MPMD detected but reload is not supported yet for neuron distributed environment with EAGER DEBUG MODE #21

wenboqian · 2024-05-04T09:16:20Z

Hi, I found the error MPMD detected but reload is not supported yet will occur if I open Eager Debug Mode for a model trained in neuron distributed environment where dp=1, tp=8, pp=4. Could you help check this issue? Thanks so much!

I attach related scripts here and you can simply run ./run_simple_model_tp_pp.sh after download them.

scripts.zip

Environment information:

EC2 Instance: trn1.32.xlarge

OS: Ubuntu 20.04

Neuron Pytorch: Latest 2.18

The text was updated successfully, but these errors were encountered:

aws-rhsoln · 2024-05-04T20:19:16Z

Thank you for reporting the issue. MPMD detected error means that 1 worker is trying to load a new graph vs the other worker is waiting to perform inference, and hence the graphs that have collectives are not able to communicate with each other. Here the assumption is that each worker performs the same set of operations and hence all workers would be in SPMD mode. This assumption is broken and hence you see the above error.
We are looking into the scripts now to identify which worker is producing a new graph (and why) and if we can modify the scripts such that graphs with collectives do not change from iteration to another. Will report back once we have an update.

wenboqian · 2024-05-05T08:57:30Z

Thank you for reporting the issue. MPMD detected error means that 1 worker is trying to load a new graph vs the other worker is waiting to perform inference, and hence the graphs that have collectives are not able to communicate with each other. Here the assumption is that each worker performs the same set of operations and hence all workers would be in SPMD mode. This assumption is broken and hence you see the above error. We are looking into the scripts now to identify which worker is producing a new graph (and why) and if we can modify the scripts such that graphs with collectives do not change from iteration to another. Will report back once we have an update.

Thanks for your detailed explain! And I'd like to know that: Will this error interrupt the process of generating hlo graphs?
I mean, my assumption is that I can't get all hlo graphs because the graph generation process will be shut down due to this error. Is my assumption right?

aws-rhsoln · 2024-05-06T00:44:30Z

So at a time, only one graph can be executed. In this case, since the run errored out at this graph, you can generate only graphs upto this point. If you want to generate all the graphs without worrying about execution, you can run with neuron_parallel_compile . The utility should help to extract all the HLOs, compile them and save them in the cache.

aws-rhsoln · 2024-05-06T20:35:13Z

We have managed to reproduce the issue. The issue happens only with eager mode. There seems to be bug which causes two collectives with different replica groups to be part of the same graph, such that it now has to communicate with two graphs at the same time. If you look at one of the graphs, it produces the following collectives in the same graph:

%all-reduce = (bf16[1,4]{1,0}, bf16[]) all-reduce(bf16[1,4]{1,0} %p1, bf16[] %convert), replica_groups={{0,8},{16,24},{1,9},{17,25},{2,10},{18,26},{3,11},{19,27},{4,12},{20,28},{5,13},{21,29},{6,14},{22,30},{7,15},{23,31}}, constrain_layout=true, to_apply=%AddComputation.7, metadata={op_type="xla__cross_replica_sum" op_name="NxDModel[model]/NxDPPModel[module]/xla__cross_replica_sum" source_file="/home/ubuntu/aws_neuron_venv/lib/python3.8/site-packages/torch_xla/core/xla_model.py" source_line=590}
...
%all-reduce.1 = (bf16[1,4]{1,0}, bf16[]) all-reduce(bf16[1,4]{1,0} %p2, bf16[] %convert.2), replica_groups={{8,16},{24,0},{9,17},{25,1},{10,18},{26,2},{11,19},{27,3},{12,20},{28,4},{13,21},{29,5},{14,22},{30,6},{15,23},{31,7}}, constrain_layout=true, to_apply=%AddComputation.20, metadata={op_type="xla__cross_replica_sum" op_name="NxDModel[model]/NxDPPModel[module]/xla__cross_replica_sum" source_file="/home/ubuntu/aws_neuron_venv/lib/python3.8/site-packages/torch_xla/core/xla_model.py" source_line=590}

As you see it is trying to send and receive a tensor at the same time. Ideally they should be in two separate graphs. We will look into this issue and update on this ticket when we have a fix.
Note: Eager debug mode is meant mainly for single worker workloads, and distributed workload with eager mode is not supported yet.
If the intention is mainly to debug the script, you can make use of this guide: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programm[…]training/pytorch-neuron-debug.html?highlight=print

jeffhataws · 2024-05-15T21:53:22Z

@wfckl789 just want to check in to see if you were you able to make forward progress?

wenboqian · 2024-05-16T21:46:20Z

@wfckl789 just want to check in to see if you were you able to make forward progress?

In this case as I obeserved, the cc compiler raised this compilation fault before the forward progress was executed. So I think the forward progress didn't make it because I didn't see the loss value from the first epoch.

aws-rhsoln · 2024-07-03T22:56:33Z

Any particular reason for trying eager mode in mutli-worker case? Note: Eager debug mode is only for debugging and is not the most performant mode of execution

aws-rhsoln mentioned this issue May 4, 2024

Error: MPMD detected but reload is not supported yet aws-neuron/aws-neuron-sdk#882

Closed

wenboqian changed the title ~~Error: MPMD detected but reload is not supported yet for neuron distributed environment~~ Error: MPMD detected but reload is not supported yet for neuron distributed environment with EAGER DEBUG MODE May 5, 2024

aws-taylor added the bug Something isn't working label Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: MPMD detected but reload is not supported yet for neuron distributed environment with EAGER DEBUG MODE #21

Error: MPMD detected but reload is not supported yet for neuron distributed environment with EAGER DEBUG MODE #21

wenboqian commented May 4, 2024

aws-rhsoln commented May 4, 2024

wenboqian commented May 5, 2024

aws-rhsoln commented May 6, 2024

aws-rhsoln commented May 6, 2024

jeffhataws commented May 15, 2024

wenboqian commented May 16, 2024 •

edited

Loading

aws-rhsoln commented Jul 3, 2024

Error: MPMD detected but reload is not supported yet for neuron distributed environment with EAGER DEBUG MODE #21

Error: MPMD detected but reload is not supported yet for neuron distributed environment with EAGER DEBUG MODE #21

Comments

wenboqian commented May 4, 2024

aws-rhsoln commented May 4, 2024

wenboqian commented May 5, 2024

aws-rhsoln commented May 6, 2024

aws-rhsoln commented May 6, 2024

jeffhataws commented May 15, 2024

wenboqian commented May 16, 2024 • edited Loading

aws-rhsoln commented Jul 3, 2024

wenboqian commented May 16, 2024 •

edited

Loading