-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSCCL all-to-all performance did not improve compared with NCCL #48
Comments
Hi @Musisoul, I think you used the original nccl-tests which doesn't have Regarding the performance drop: clearly for cross-node communication, some networking interface will be used. If you can provide your log with Lastly, please note that in-place alltoall is impossible to perform correctly. As you can see in your logs, the #wrong column is N/A for in-place alltoall even for NCCL runs. The algorithms that you generated with msccl-tools is only valid for out-of-place as well. I hope this helps! |
Thanks for your reply! I'll change the version of nccl-tests and try again. |
Thanks for the kindly reply. We test according to msccl readme. the changes in alltoall.cu means msccl generated new API |
I found msccl2DAllToAll in msccl , seems there are two kinds of all2all in msccl?
Three questions:
|
Hi, I have tried the forked version of nccl-tests, the result is different from previous, but the in-place result is extremely high compared with out-of-place. That seems to be abnormal. Here is the 2-nodes(16 GPUs) all-to-all test result:
The complete logs are attached below. What's the correct usage of all-to-all tests? Does it differ from allreduce-tests, like generating xmls in msccl-tools and the startup code(e.g. NCCL_ALGO=MSCCL,RING,TREE) in nccl-tests? |
This is an interesting topology you have! 8xA100s with only a single IB! I think your IB has 200Gbps BW which explains your numbers. I think one can design a much better algorithm for your specific topology but the current one you are currently using seems to be delivering a pretty good performance. If you really want the maximum performance, we can help you design the algorithm for your topology. |
That's correct. NCCL made the decision not to make alltoall a collective with an API but RCCL (AMD's version of NCCL) and any MPI implementation has alltoall as an API. Because of this, we apply a patch to PyTorch to change alltoall's implementation with a call to our API. New PyTorch has this API for RCCL in the latest versions. |
1 similar comment
That's correct. NCCL made the decision not to make alltoall a collective with an API but RCCL (AMD's version of NCCL) and any MPI implementation has alltoall as an API. Because of this, we apply a patch to PyTorch to change alltoall's implementation with a call to our API. New PyTorch has this API for RCCL in the latest versions. |
These numbers make sense to me given your topology. Regarding in-place: note that the error column is non-zero which means that the implementation is incorrect. The reason is that in the forked repo, we completely disabled in-place alltoall as no one expect in-pace alltoall to work correctly. That's why the in-place numbers are crazy high since they do not perform any communication and it immediately returns. |
Great questions! AllToAll: msccl2DAllToAll is triggered if the
I hope this helps. |
sure, we really want the maximum performance under 42 A100 nodes with only 1 200Gbps IB per node. it would be great if you guys can help design the algorithm. BTW, I've read the gc3 paper and find it's inspiring, it's very useful and effective in this case. |
What input sizes do you need the alltoall for? Your algBW at max should be Regarding the proper algorithm for your use case: you might wanna use something like the 3-step algorithm for alltoall except that the relay logic needs to change. This means that we need to (1) gather cross-node traffic to a local GPU (let's say local GPU 0) and (2) then GPU 0 sends cross-node data to local GPU 0 on other nodes. (3) Lastly, GPU 0 scatters the data to everyone. Thus, the name 3-step algorithm. Hope this helps. |
256MiB. About the relay logic, do we need to find proper local GPU? I don't know how to find the GPU which is nearest to IB, bypeps paper says this will be faster? |
As Jack47 says, the size we need is 256MiB. BTW, I have tried the nccl-tests(forked version) and compared the time and busbw between NCCL and MSCCL algos, both at out-of-place. It seems that MSCCL algos did not improve. I have used 8 nodes to test and the size was 256MiB. What do you mean by quoting
Thank you!
|
See if this helps https://github.com/microsoft/inspector-topo Another good option is |
I am not surprised. Both the 3-step and 2-step algorithms are over-subscribing the single IB on each node. Those two were not designed for your topology. I think a modified 3-step algorithm where all cross-node traffic goes through a single GPU would work the best. NCCL gives you 2.15 but theoretically (based purely on bandwidth) you should be able to get around 3.4 and the reason for this gap is because of the suboptimal algorithm NCCL is using for your topology. Can you please share your 64xGPU results for a large range from 1KB to 4GB (or however large it can run)? |
Here is the result of
|
There is the answer, it's connected to GPU 0 and 1 over a PCIe switch. |
Cool! We can simply modify CrossNodeGpus to just use local rank 0 to implement the algo you mentioned? By the way, what's the difference between alltoall two step and three step? two step : (1) gather cross-node traffic to a local GPU and (2) then local GPU sends cross-node data to local GPU peers on other nodes. so in two step we don't group chunks into a larger one? |
Exactly! BTW, I am not sure if it will work better for you, but just judging from my intuitions.
So, in the two-step algorithm, all of the cross-node traffic, for example, to local GPU 6 on node 0 is via local GPU 6 on other nodes. So, local GPU i on node j aggregates the traffic from other local GPUs on node j that needs to go to other local GPU i on other nodes. In three-step algorithm, the aggregation is more aggressive, and it was only beneficial in our experiments at 64 node scale and larger. In this algorithm, node i and node j communicate via only one local GPU on each (the CrossNodeGpus logigc). In your topology, because you have only a single IB, the three-step algorithm with a more aggressive aggregation would make more sense. We will never know until we try! Also, try NPKit profiler for further analysis of your algorithm! |
Great thanks for you help, will use NPKit profiler to get more information. BTW, after finding a way to improve all-to-all performance, we want to improve all-reduce latency in same scale. So I think NPKit will help further. |
Hi saeedmaleki, we have tried to modify the 3-step algo in this way: But the nccl-test result seems not to behave better than NCCL. We also profiled the algo with NPKIT. The trace result of NPKIT is hard to understand. Could you please have a look at this result? We tested the result on 2 nodes(16 GPUs). Thank you! |
Here is the result of nccl-tests(using origin NCCL) on 64xGPU for a large range from 1KB to 4GB:
|
This looks like a good algorithm. I don't expect it to do much differently for two nodes. But I believe you should see better results for larger scale. Also, the trace file seems to be only for NCCL default algorithm. What size did you run to get this trace file? |
These numbers are a bit unexpected. Are you sure that your IB is 200Gbps? With 200Gbps, theoretically, you should be seeing 3.57GBps algBW. If we consider 90% efficiency of IB, we should be seeing around 3.21GBps and that is a 2.5x missing performance. Can you please run AllGather on two nodes on a long range of sizes and share the numbers? That way we can be sure about the BW of the IB. |
There is also an inconsistency with 1.30GBps algBW at 4GB input size on 64xGPUs. Earlier you got >2GBps for the same setting as I can see from the logs you shared. Right? |
Well, we did use modified 3-step MSCCL algo because "Connected 1 MSCCL algorithm" was in the log. The size is only 256MiB. |
Then everything should have worked just fine. Can you please share your 3-step algorithm so that I can check it from my side? There might be some bug in MSCCL which we need to fix. |
Here is the all-gather result of 2 nodes on a long range of sizes:
|
OK great! This is >12.5GBps which means that your IB's bw is close to 25GBps (or 200Gbps). Therefore, your AllToAll can be hugely improved by utilizing the right algorithm. Please share you 3-step algorithm so that I can take a look :). |
Okay. Thanks! |
It could have some fluctuations, because we tested in a cluster and the machines were not fixed. BTW, the IBs are not all 200Gbps, each node has 3 IBs. Here is the result of 'ibstatus' in a node: |
You need to look at NCCL log |
It should use the fastest one(0), because we can find some infos in previous logs:
|
wow, great news. May you give some hints on it? we want to use small chunks on ib to make it faster, but seems currently msccl use fixed chunks in all2all? |
@saeedmaleki long time no see! |
Hi @Jack47, I took a look at your algorithm and it seems that the local node communication has too many steps which can be optimized. To evaluate this, I would suggest two things: Do these steps make sense? |
hi @saeedmaleki , great thanks for your advice, there is two nodes NPKit result: https://github.com/Musisoul/NPKIT-results/blob/main/trace.zip. we will try one GPU per node on 8 nodes. |
It seems that this is still using default NCCL algorithm. Did you try the algorithm you developed for this run? |
Maybe we should try another time, but the previous NPKIT result was generated by modified 3-step algo. We will also try one GPU per node on 8 nodes. These tests may take some time. |
Sorry for not replying for a long time. We have tried all-to-all test with one GPU per node on 8 nodes. Here is the result:
The log is attached below: We have tried the NPKIT on 2 nodes using our modified three-step algo. Here is the result: We found that when using tools/npkit_trace_generator.py to generate json from npkit_dump_dir, there were some bugs in the python script:
We circumvented these bugs and generated the json. Previously you said "it seems that this is still using default NCCL algorithm", will these bugs affect the results? Thank you! |
Previously we provided you the modified three-step algo, what is the performance of the algo on your machines? Does this code need to be improved or has some bugs? We are looking forward for your reply. Thank you! |
Update: This file could be too large for view tracer to open, we retested the modified three-step algo with smaller iterations and got this result: |
This is your key results pinpointing the problem. You are getting 9.86 GBps busBW which is a good 2x off. it should have been ~ 22-25 GBps which is your IB's BW. You might have a bad node in the system. I suggest narrowing down the experiment to 2 and 4 GPUs per experiment to find the problematic node. As far as I remember your AllGather result had great numbers, so it seems like something is off. Without fixing this issue, maximum busBW you may get on 64 GPUs is 9.85/8~1.23 GBps which is way below what it needs to be. Please let me know of your investigation and we can find the problematic node. |
I couldn't open the zip file, please reupload it? |
I can unzip the npkit_event_trace_20230108.zip, so should I upload the json directly? |
#48 (comment) Seems nccl-test doesn't refer to msccl correctly. |
Oops sorry I dropped the ball on this one. Yes please |
I suggest trying giving it an absolute path instead of relative. |
Hi, I see from NCCL 2.12, the default NCCL primitives have used PXN to optimize cross-node traffic. PXN will also aggregate cross-node messages to the same local GPU and then send them with the shared connections. Thus, Is there any difference between MSCCL two-step algorithm with PXN optimization in NCCL?
|
From our experience, there are still differences at 64 GPUs and up but PXN does a pretty good job as well. But after 1024 GPUs, you will need to switch to the 3-step algorithm. I think PXN is disabled after certain number of GPUs. |
If I understand correctly, in 3-step algorithm, only one GPU's NIC will be used to exchange cross-node traffic. If one node has 8GPU and 8 NIC,the available bandwidth will be decreased to 1/8. Could you show more experience why 3-step algorithm could improve performance after 1024 GPUs? |
Not exactly. Imagine you have 9 nodes in total. On node-0, NIC-0 will be talking to node-1, NIC-1 will be talking to node-2, ... , and NIC-7 will be talking to node-8. So, you can imagine each first step is a local-gather operation, second step is a cross-node communication and last step is a local scatter operation. This reduces the number of cross-node connections by another 8x over the 2-step algorithm. In general, for a perfect load-balance, we need 8K+1 nodes for 3-step algorithm. However, at scale (1024 GPUs for example), the load imbalance is not too bad. |
Hi, I have tried nccl-alltoall_perf-tests on 1/2/8 nodes with 8xA100 GPUs and found that the performance of msccl(in-place) did not imporve compared with nccl(out-of-place). My MSCCL_XML_FILES were generated by
python msccl-tools/examples/mscclang/alltoall_a100_two_step.py.py --protocol=LL 8 8 > two_step_64.xml
. I also triedalltoall_a100_three_step.py
andalltoall_allpairs.py
, they all behaved similarly.The test code is
nccl-tests/build/alltoall_perf -b 1MB -e 1024MB -f 2 -g 1 -n 100 -w 100
, and I used 8/16/64 GPUs to run it, corresponding to 1/2/8 nodes.The alltoall-test result of 8 nodes is like this:
I also find that the Avg bus bandwidth drops sharply on multi-nodes(2/8) compared with one node. I have attached the logs of 8/16/64 GPUs below. Thank you!
gpu8-two_step.log
gpu16-two_step.log
gpu64-two_step.log
The text was updated successfully, but these errors were encountered: