Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSCCL all-to-all performance did not improve compared with NCCL #48

Open
Musisoul opened this issue Nov 17, 2022 · 54 comments
Open

MSCCL all-to-all performance did not improve compared with NCCL #48

Musisoul opened this issue Nov 17, 2022 · 54 comments

Comments

@Musisoul
Copy link

Hi, I have tried nccl-alltoall_perf-tests on 1/2/8 nodes with 8xA100 GPUs and found that the performance of msccl(in-place) did not imporve compared with nccl(out-of-place). My MSCCL_XML_FILES were generated by python msccl-tools/examples/mscclang/alltoall_a100_two_step.py.py --protocol=LL 8 8 > two_step_64.xml. I also tried alltoall_a100_three_step.py and alltoall_allpairs.py, they all behaved similarly.
The test code is nccl-tests/build/alltoall_perf -b 1MB -e 1024MB -f 2 -g 1 -n 100 -w 100, and I used 8/16/64 GPUs to run it, corresponding to 1/2/8 nodes.
The alltoall-test result of 8 nodes is like this:

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     1048576          4096     float    none      -1   9012.2    0.12    0.11      0    561.5    1.87    1.84    N/A
     2097152          8192     float    none      -1   1067.7    1.96    1.93      0   1046.3    2.00    1.97    N/A
     4194304         16384     float    none      -1   2010.8    2.09    2.05      0   2023.0    2.07    2.04    N/A
     8388608         32768     float    none      -1   5698.5    1.47    1.45      0   4261.4    1.97    1.94    N/A
    16777216         65536     float    none      -1   8339.5    2.01    1.98      0   8211.3    2.04    2.01    N/A
    33554432        131072     float    none      -1    16235    2.07    2.03      0    16281    2.06    2.03    N/A
    67108864        262144     float    none      -1    32252    2.08    2.05      0    51440    1.30    1.28    N/A
   134217728        524288     float    none      -1    63877    2.10    2.07      0    83221    1.61    1.59    N/A
   268435456       1048576     float    none      -1   147334    1.82    1.79      0   142747    1.88    1.85    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.77934 

I also find that the Avg bus bandwidth drops sharply on multi-nodes(2/8) compared with one node. I have attached the logs of 8/16/64 GPUs below. Thank you!
gpu8-two_step.log
gpu16-two_step.log
gpu64-two_step.log

@saeedmaleki
Copy link
Contributor

Hi @Musisoul, I think you used the original nccl-tests which doesn't have ncclAllToAll and therefore, all of your logs correspond to NCCL's alltoall algorithm. Please change the nccl-tests alltoall.cu's way of calling alltoall to this way. Alternatively, you could just use this forked version of nccl-tests to test alltoall performance.

Regarding the performance drop: clearly for cross-node communication, some networking interface will be used. If you can provide your log with NCCL_DEBUG=INFO environment variable, I would be able to help you better.

Lastly, please note that in-place alltoall is impossible to perform correctly. As you can see in your logs, the #wrong column is N/A for in-place alltoall even for NCCL runs. The algorithms that you generated with msccl-tools is only valid for out-of-place as well.

I hope this helps!

@Musisoul
Copy link
Author

Hi @Musisoul, I think you used the original nccl-tests which doesn't have ncclAllToAll and therefore, all of your logs correspond to NCCL's alltoall algorithm. Please change the nccl-tests alltoall.cu's way of calling alltoall to this way. Alternatively, you could just use this forked version of nccl-tests to test alltoall performance.

Regarding the performance drop: clearly for cross-node communication, some networking interface will be used. If you can provide your log with NCCL_DEBUG=INFO environment variable, I would be able to help you better.

Lastly, please note that in-place alltoall is impossible to perform correctly. As you can see in your logs, the #wrong column is N/A for in-place alltoall even for NCCL runs. The algorithms that you generated with msccl-tools is only valid for out-of-place as well.

I hope this helps!

Thanks for your reply! I'll change the version of nccl-tests and try again.
As for the performance drop, I have attached the logs using NCCL_DEBUG=INFO below.
gpu8-two_step_complete.log
gpu64-two_step_complete.log

@Jack47
Copy link

Jack47 commented Nov 18, 2022

Thanks for the kindly reply. We test according to msccl readme. the changes in alltoall.cu means msccl generated new API ncclAllToAll, which is currently not compatible with nccl API? nccl currently doesn't have this API.

@Jack47
Copy link

Jack47 commented Nov 18, 2022

I found msccl2DAllToAll in msccl , seems there are two kinds of all2all in msccl?

  1. native implemented msccl2DAllToAll
  2. algorithms generated using msccl-tools

Three questions:

  1. what's the correct way to let msccl use native msccl2DAllToAll?
  2. is NCCL_ALGO used to specify the priority order of available algos?
  3. when generating allreduce algos in msccl-tools, we should specify num nodes, this means whenever I changed the training scale, I should first generate coordinate msccl xmls?

@Musisoul Musisoul changed the title MSCCL performance did not improve compared with NCCL MSCCL all-to-all performance did not improve compared with NCCL Nov 18, 2022
@Musisoul
Copy link
Author

Hi @Musisoul, I think you used the original nccl-tests which doesn't have ncclAllToAll and therefore, all of your logs correspond to NCCL's alltoall algorithm. Please change the nccl-tests alltoall.cu's way of calling alltoall to this way. Alternatively, you could just use this forked version of nccl-tests to test alltoall performance.

Regarding the performance drop: clearly for cross-node communication, some networking interface will be used. If you can provide your log with NCCL_DEBUG=INFO environment variable, I would be able to help you better.

Lastly, please note that in-place alltoall is impossible to perform correctly. As you can see in your logs, the #wrong column is N/A for in-place alltoall even for NCCL runs. The algorithms that you generated with msccl-tools is only valid for out-of-place as well.

I hope this helps!

Hi, I have tried the forked version of nccl-tests, the result is different from previous, but the in-place result is extremely high compared with out-of-place. That seems to be abnormal. Here is the 2-nodes(16 GPUs) all-to-all test result:

#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     1048576         16384   float            571.5    1.83    1.72  0e+00     0.50  2087.88  1957.39  1e+00
     2097152         32768   float           1043.0    2.01    1.88  0e+00     0.21  9985.96  9361.84  1e+00
     4194304         65536   float           1972.5    2.13    1.99  0e+00     0.22  18941.04  17757.23  1e+00
     8388608        131072   float           3909.3    2.15    2.01  0e+00     0.51  16433.75  15406.64  1e+00
    16777216        262144   float           7596.3    2.21    2.07  0e+00     0.45  37659.30  35305.59  1e+00
    33554432        524288   float            14892    2.25    2.11  0e+00     0.17  196339.57  184068.34  1e+00
    67108864       1048576   float            29605    2.27    2.13  0e+00     0.21  320099.52  300093.30  1e+00
   134217728       2097152   float            33285    4.03    3.78  0e+00     0.49  271437.56  254472.71  1e+00
   268435456       4194304   float            76900    3.49    3.27  0e+00     0.14  1938582.05  1817420.67  1e+00
   536870912       8388608   float           181940    2.95    2.77  0e+00     0.40  1341070.90  1257253.97  1e+00
  1073741824      16777216   float           285011    3.77    3.53  0e+00     0.47  2292116.18  2148858.92  1e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 274636 

The complete logs are attached below.
n8-two_step-complete.log
n16-two_step-complete.log

What's the correct usage of all-to-all tests? Does it differ from allreduce-tests, like generating xmls in msccl-tools and the startup code(e.g. NCCL_ALGO=MSCCL,RING,TREE) in nccl-tests?
Thank you!

@saeedmaleki
Copy link
Contributor

Hi @Musisoul, I think you used the original nccl-tests which doesn't have ncclAllToAll and therefore, all of your logs correspond to NCCL's alltoall algorithm. Please change the nccl-tests alltoall.cu's way of calling alltoall to this way. Alternatively, you could just use this forked version of nccl-tests to test alltoall performance.
Regarding the performance drop: clearly for cross-node communication, some networking interface will be used. If you can provide your log with NCCL_DEBUG=INFO environment variable, I would be able to help you better.
Lastly, please note that in-place alltoall is impossible to perform correctly. As you can see in your logs, the #wrong column is N/A for in-place alltoall even for NCCL runs. The algorithms that you generated with msccl-tools is only valid for out-of-place as well.
I hope this helps!

Thanks for your reply! I'll change the version of nccl-tests and try again. As for the performance drop, I have attached the logs using NCCL_DEBUG=INFO below. gpu8-two_step_complete.log gpu64-two_step_complete.log

This is an interesting topology you have! 8xA100s with only a single IB! I think your IB has 200Gbps BW which explains your numbers. I think one can design a much better algorithm for your specific topology but the current one you are currently using seems to be delivering a pretty good performance. If you really want the maximum performance, we can help you design the algorithm for your topology.

@saeedmaleki
Copy link
Contributor

Thanks for the kindly reply. We test according to msccl readme. the changes in alltoall.cu means msccl generated new API ncclAllToAll, which is currently not compatible with nccl API? nccl currently doesn't have this API.

That's correct. NCCL made the decision not to make alltoall a collective with an API but RCCL (AMD's version of NCCL) and any MPI implementation has alltoall as an API. Because of this, we apply a patch to PyTorch to change alltoall's implementation with a call to our API. New PyTorch has this API for RCCL in the latest versions.

1 similar comment
@saeedmaleki
Copy link
Contributor

Thanks for the kindly reply. We test according to msccl readme. the changes in alltoall.cu means msccl generated new API ncclAllToAll, which is currently not compatible with nccl API? nccl currently doesn't have this API.

That's correct. NCCL made the decision not to make alltoall a collective with an API but RCCL (AMD's version of NCCL) and any MPI implementation has alltoall as an API. Because of this, we apply a patch to PyTorch to change alltoall's implementation with a call to our API. New PyTorch has this API for RCCL in the latest versions.

@saeedmaleki
Copy link
Contributor

Hi @Musisoul, I think you used the original nccl-tests which doesn't have ncclAllToAll and therefore, all of your logs correspond to NCCL's alltoall algorithm. Please change the nccl-tests alltoall.cu's way of calling alltoall to this way. Alternatively, you could just use this forked version of nccl-tests to test alltoall performance.
Regarding the performance drop: clearly for cross-node communication, some networking interface will be used. If you can provide your log with NCCL_DEBUG=INFO environment variable, I would be able to help you better.
Lastly, please note that in-place alltoall is impossible to perform correctly. As you can see in your logs, the #wrong column is N/A for in-place alltoall even for NCCL runs. The algorithms that you generated with msccl-tools is only valid for out-of-place as well.
I hope this helps!

Hi, I have tried the forked version of nccl-tests, the result is different from previous, but the in-place result is extremely high compared with out-of-place. That seems to be abnormal. Here is the 2-nodes(16 GPUs) all-to-all test result:

#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     1048576         16384   float            571.5    1.83    1.72  0e+00     0.50  2087.88  1957.39  1e+00
     2097152         32768   float           1043.0    2.01    1.88  0e+00     0.21  9985.96  9361.84  1e+00
     4194304         65536   float           1972.5    2.13    1.99  0e+00     0.22  18941.04  17757.23  1e+00
     8388608        131072   float           3909.3    2.15    2.01  0e+00     0.51  16433.75  15406.64  1e+00
    16777216        262144   float           7596.3    2.21    2.07  0e+00     0.45  37659.30  35305.59  1e+00
    33554432        524288   float            14892    2.25    2.11  0e+00     0.17  196339.57  184068.34  1e+00
    67108864       1048576   float            29605    2.27    2.13  0e+00     0.21  320099.52  300093.30  1e+00
   134217728       2097152   float            33285    4.03    3.78  0e+00     0.49  271437.56  254472.71  1e+00
   268435456       4194304   float            76900    3.49    3.27  0e+00     0.14  1938582.05  1817420.67  1e+00
   536870912       8388608   float           181940    2.95    2.77  0e+00     0.40  1341070.90  1257253.97  1e+00
  1073741824      16777216   float           285011    3.77    3.53  0e+00     0.47  2292116.18  2148858.92  1e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 274636 

The complete logs are attached below. n8-two_step-complete.log n16-two_step-complete.log

What's the correct usage of all-to-all tests? Does it differ from allreduce-tests, like generating xmls in msccl-tools and the startup code(e.g. NCCL_ALGO=MSCCL,RING,TREE) in nccl-tests? Thank you!

These numbers make sense to me given your topology. Regarding in-place: note that the error column is non-zero which means that the implementation is incorrect. The reason is that in the forked repo, we completely disabled in-place alltoall as no one expect in-pace alltoall to work correctly. That's why the in-place numbers are crazy high since they do not perform any communication and it immediately returns.

@saeedmaleki
Copy link
Contributor

I found msccl2DAllToAll in msccl , seems there are two kinds of all2all in msccl?

  1. native implemented msccl2DAllToAll
  2. algorithms generated using msccl-tools

Three questions:

  1. what's the correct way to let msccl use native msccl2DAllToAll?
  2. is NCCL_ALGO used to specify the priority order of available algos?
  3. when generating allreduce algos in msccl-tools, we should specify num nodes, this means whenever I changed the training scale, I should first generate coordinate msccl xmls?

I found msccl2DAllToAll in msccl , seems there are two kinds of all2all in msccl?

  1. native implemented msccl2DAllToAll
  2. algorithms generated using msccl-tools

Three questions:

  1. what's the correct way to let msccl use native msccl2DAllToAll?
  2. is NCCL_ALGO used to specify the priority order of available algos?
  3. when generating allreduce algos in msccl-tools, we should specify num nodes, this means whenever I changed the training scale, I should first generate coordinate msccl xmls?

Great questions!

AllToAll: msccl2DAllToAll is triggered if the name filed inside XML is just 2D. However, it only performs better than the one you are generating with msccl-tools when there are >1K GPUs. This was a hacky way for us to compare the performance of the two approaches it will be removed in the future releases. NCCL_ALGO doesn't need to change.

msccl.init from here is an automatic way to manage these XMLs. But given that you have a unique topology, you won't necessarily get the maximum bang for the buck with the same algorithms. The ones that we have studied the performance for were on 8xA100 nodes with 8xIBs such as NDv4 SKU offered on Azure.

I hope this helps.

@Jack47
Copy link

Jack47 commented Nov 18, 2022

If you really want the maximum performance, we can help you design the algorithm for your topology

sure, we really want the maximum performance under 42 A100 nodes with only 1 200Gbps IB per node. it would be great if you guys can help design the algorithm.

BTW, I've read the gc3 paper and find it's inspiring, it's very useful and effective in this case.

@saeedmaleki
Copy link
Contributor

What input sizes do you need the alltoall for? Your algBW at max should be 25GBps/(N-1). For example, for 8 nodes, the maximum you can get at very large sizes is 3.57GBps algBW.

Regarding the proper algorithm for your use case: you might wanna use something like the 3-step algorithm for alltoall except that the relay logic needs to change. This means that we need to (1) gather cross-node traffic to a local GPU (let's say local GPU 0) and (2) then GPU 0 sends cross-node data to local GPU 0 on other nodes. (3) Lastly, GPU 0 scatters the data to everyone. Thus, the name 3-step algorithm.

Hope this helps.

@Jack47
Copy link

Jack47 commented Nov 21, 2022

What input sizes do you need the alltoall for?

256MiB.

About the relay logic, do we need to find proper local GPU? I don't know how to find the GPU which is nearest to IB, bypeps paper says this will be faster?

@Musisoul
Copy link
Author

What input sizes do you need the alltoall for? Your algBW at max should be 25GBps/(N-1). For example, for 8 nodes, the maximum you can get at very large sizes is 3.57GBps algBW.

Regarding the proper algorithm for your use case: you might wanna use something like the 3-step algorithm for alltoall except that the relay logic needs to change. This means that we need to (1) gather cross-node traffic to a local GPU (let's say local GPU 0) and (2) then GPU 0 sends cross-node data to local GPU 0 on other nodes. (3) Lastly, GPU 0 scatters the data to everyone. Thus, the name 3-step algorithm.

Hope this helps.

As Jack47 says, the size we need is 256MiB. BTW, I have tried the nccl-tests(forked version) and compared the time and busbw between NCCL and MSCCL algos, both at out-of-place. It seems that MSCCL algos did not improve. I have used 8 nodes to test and the size was 256MiB. What do you mean by quoting

the current one you are currently using seems to be delivering a pretty good performance.

Thank you!

  time(us) busbw(GB/s)
NCCL(out of place) 123058 2.15
MSCCL two_step(out of place) 165042 1.60
MSCCL three_step(out of place) 150965 1.78

@saeedmaleki
Copy link
Contributor

What input sizes do you need the alltoall for?

256MiB.

About the relay logic, do we need to find proper local GPU? I don't know how to find the GPU which is nearest to IB, bypeps paper says this will be faster?

See if this helps https://github.com/microsoft/inspector-topo
We have a similar type of node on Azure and we use inspector-topo to find the IB placement. However, it might now matter much given that you have a newer machine.

Another good option is nvidia-smi topo -m which might give you more information. If you can share that log in here, that would be great.

@saeedmaleki
Copy link
Contributor

What input sizes do you need the alltoall for? Your algBW at max should be 25GBps/(N-1). For example, for 8 nodes, the maximum you can get at very large sizes is 3.57GBps algBW.
Regarding the proper algorithm for your use case: you might wanna use something like the 3-step algorithm for alltoall except that the relay logic needs to change. This means that we need to (1) gather cross-node traffic to a local GPU (let's say local GPU 0) and (2) then GPU 0 sends cross-node data to local GPU 0 on other nodes. (3) Lastly, GPU 0 scatters the data to everyone. Thus, the name 3-step algorithm.
Hope this helps.

As Jack47 says, the size we need is 256MiB. BTW, I have tried the nccl-tests(forked version) and compared the time and busbw between NCCL and MSCCL algos, both at out-of-place. It seems that MSCCL algos did not improve. I have used 8 nodes to test and the size was 256MiB. What do you mean by quoting

the current one you are currently using seems to be delivering a pretty good performance.

Thank you!

  time(us) busbw(GB/s)
NCCL(out of place) 123058 2.15
MSCCL two_step(out of place) 165042 1.60
MSCCL three_step(out of place) 150965 1.78

I am not surprised. Both the 3-step and 2-step algorithms are over-subscribing the single IB on each node. Those two were not designed for your topology. I think a modified 3-step algorithm where all cross-node traffic goes through a single GPU would work the best.

NCCL gives you 2.15 but theoretically (based purely on bandwidth) you should be able to get around 3.4 and the reason for this gap is because of the suboptimal algorithm NCCL is using for your topology. Can you please share your 64xGPU results for a large range from 1KB to 4GB (or however large it can run)?

@Musisoul
Copy link
Author

What input sizes do you need the alltoall for?

256MiB.
About the relay logic, do we need to find proper local GPU? I don't know how to find the GPU which is nearest to IB, bypeps paper says this will be faster?

See if this helps https://github.com/microsoft/inspector-topo We have a similar type of node on Azure and we use inspector-topo to find the IB placement. However, it might now matter much given that you have a newer machine.

Another good option is nvidia-smi topo -m which might give you more information. If you can share that log in here, that would be great.

Here is the result of nvidia-smi topo -m in our machine:

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     PXB     0-27,56-83      0
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     PXB     0-27,56-83      0
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    NODE    NODE    NODE    0-27,56-83      0
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    NODE    NODE    NODE    0-27,56-83      0
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     28-55,84-111    1
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     28-55,84-111    1
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     SYS     28-55,84-111    1
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     SYS     28-55,84-111    1
mlx5_0  PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS      X      PIX     PIX
mlx5_1  PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     PIX      X      PIX
mlx5_2  PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     PIX     PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

@saeedmaleki
Copy link
Contributor

What input sizes do you need the alltoall for?

256MiB.
About the relay logic, do we need to find proper local GPU? I don't know how to find the GPU which is nearest to IB, bypeps paper says this will be faster?

See if this helps https://github.com/microsoft/inspector-topo We have a similar type of node on Azure and we use inspector-topo to find the IB placement. However, it might now matter much given that you have a newer machine.
Another good option is nvidia-smi topo -m which might give you more information. If you can share that log in here, that would be great.

Here is the result of nvidia-smi topo -m in our machine:

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     PXB     0-27,56-83      0
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     PXB     0-27,56-83      0
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    NODE    NODE    NODE    0-27,56-83      0
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    NODE    NODE    NODE    0-27,56-83      0
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     28-55,84-111    1
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     28-55,84-111    1
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     SYS     28-55,84-111    1
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     SYS     28-55,84-111    1
mlx5_0  PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS      X      PIX     PIX
mlx5_1  PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     PIX      X      PIX
mlx5_2  PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     PIX     PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

There is the answer, it's connected to GPU 0 and 1 over a PCIe switch.

@Jack47
Copy link

Jack47 commented Nov 22, 2022

There is the answer, it's connected to GPU 0 and 1 over a PCIe switch.

Regarding the proper algorithm for your use case: you might wanna use something like the 3-step algorithm for alltoall except that the relay logic needs to change. This means that we need to (1) gather cross-node traffic to a local GPU (let's say local GPU 0) and (2) then GPU 0 sends cross-node data to local GPU 0 on other nodes. (3) Lastly, GPU 0 scatters the data to everyone. Thus, the name 3-step algorithm.

Cool! We can simply modify CrossNodeGpus to just use local rank 0 to implement the algo you mentioned?

By the way, what's the difference between alltoall two step and three step? two step : (1) gather cross-node traffic to a local GPU and (2) then local GPU sends cross-node data to local GPU peers on other nodes. so in two step we don't group chunks into a larger one?

@saeedmaleki
Copy link
Contributor

There is the answer, it's connected to GPU 0 and 1 over a PCIe switch.

Regarding the proper algorithm for your use case: you might wanna use something like the 3-step algorithm for alltoall except that the relay logic needs to change. This means that we need to (1) gather cross-node traffic to a local GPU (let's say local GPU 0) and (2) then GPU 0 sends cross-node data to local GPU 0 on other nodes. (3) Lastly, GPU 0 scatters the data to everyone. Thus, the name 3-step algorithm.

Cool! We can simply modify CrossNodeGpus to just use local rank 0 to implement the algo you mentioned?

Exactly! BTW, I am not sure if it will work better for you, but just judging from my intuitions.

By the way, what's the difference between alltoall two step and three step? two step : (1) gather cross-node traffic to a local GPU and (2) then local GPU sends cross-node data to local GPU peers on other nodes. so in two step we don't group chunks into a larger one?

So, in the two-step algorithm, all of the cross-node traffic, for example, to local GPU 6 on node 0 is via local GPU 6 on other nodes. So, local GPU i on node j aggregates the traffic from other local GPUs on node j that needs to go to other local GPU i on other nodes.

In three-step algorithm, the aggregation is more aggressive, and it was only beneficial in our experiments at 64 node scale and larger. In this algorithm, node i and node j communicate via only one local GPU on each (the CrossNodeGpus logigc). In your topology, because you have only a single IB, the three-step algorithm with a more aggressive aggregation would make more sense. We will never know until we try!

Also, try NPKit profiler for further analysis of your algorithm!

@Jack47
Copy link

Jack47 commented Nov 23, 2022

Great thanks for you help, will use NPKit profiler to get more information. BTW, after finding a way to improve all-to-all performance, we want to improve all-reduce latency in same scale. So I think NPKit will help further.

@Musisoul
Copy link
Author

In your topology, because you have only a single IB, the three-step algorithm with a more aggressive aggregation would make more sense. We will never know until we try! Also, try NPKit profiler for further analysis of your algorithm!

Hi saeedmaleki, we have tried to modify the 3-step algo in this way:

1

But the nccl-test result seems not to behave better than NCCL. We also profiled the algo with NPKIT. The trace result of NPKIT is hard to understand. Could you please have a look at this result? We tested the result on 2 nodes(16 GPUs). Thank you!
trace.zip

@Musisoul
Copy link
Author

NCCL gives you 2.15 but theoretically (based purely on bandwidth) you should be able to get around 3.4 and the reason for this gap is because of the suboptimal algorithm NCCL is using for your topology. Can you please share your 64xGPU results for a large range from 1KB to 4GB (or however large it can run)?

Here is the result of nccl-tests(using origin NCCL) on 64xGPU for a large range from 1KB to 4GB:

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
        1024             4     float    none      -1   3343.1    0.00    0.00      0    560.6    0.00    0.00    N/A
        2048             8     float    none      -1    165.6    0.01    0.01      0    161.9    0.01    0.01    N/A
        4096            16     float    none      -1    160.8    0.03    0.03      0    162.2    0.03    0.02    N/A
        8192            32     float    none      -1    160.9    0.05    0.05      0    161.3    0.05    0.05    N/A
       16384            64     float    none      -1    160.9    0.10    0.10      0    160.3    0.10    0.10    N/A
       32768           128     float    none      -1    161.1    0.20    0.20      0    161.2    0.20    0.20    N/A
       65536           256     float    none      -1    173.3    0.38    0.37      0    181.3    0.36    0.36    N/A
      131072           512     float    none      -1    211.3    0.62    0.61      0    209.6    0.63    0.62    N/A
      262144          1024     float    none      -1    243.4    1.08    1.06      0    242.2    1.08    1.07    N/A
      524288          2048     float    none      -1    416.9    1.26    1.24      0    416.9    1.26    1.24    N/A
     1048576          4096     float    none      -1    821.2    1.28    1.26      0    815.6    1.29    1.27    N/A
     2097152          8192     float    none      -1   1601.1    1.31    1.29      0   1668.1    1.26    1.24    N/A
     4194304         16384     float    none      -1   3208.9    1.31    1.29      0   3200.8    1.31    1.29    N/A
     8388608         32768     float    none      -1   6545.6    1.28    1.26      0   6629.6    1.27    1.25    N/A
    16777216         65536     float    none      -1    12890    1.30    1.28      0    12827    1.31    1.29    N/A
    33554432        131072     float    none      -1    25914    1.29    1.27      0    25847    1.30    1.28    N/A
    67108864        262144     float    none      -1    52074    1.29    1.27      0    52211    1.29    1.27    N/A
   134217728        524288     float    none      -1   103897    1.29    1.27      0   103746    1.29    1.27    N/A
   268435456       1048576     float    none      -1   240116    1.12    1.10      0   206585    1.30    1.28    N/A
   536870912       2097152     float    none      -1   413198    1.30    1.28      0   475190    1.13    1.11    N/A
  1073741824       4194304     float    none      -1   823209    1.30    1.28      0   823260    1.30    1.28    N/A
  2147483648       8388608     float    none      -1  1647891    1.30    1.28      0  1646527    1.30    1.28    N/A
  4294967296      16777216     float    none      -1  3295211    1.30    1.28      0  3299375    1.30    1.28    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0.872544

@saeedmaleki
Copy link
Contributor

In your topology, because you have only a single IB, the three-step algorithm with a more aggressive aggregation would make more sense. We will never know until we try! Also, try NPKit profiler for further analysis of your algorithm!

Hi saeedmaleki, we have tried to modify the 3-step algo in this way:

1

But the nccl-test result seems not to behave better than NCCL. We also profiled the algo with NPKIT. The trace result of NPKIT is hard to understand. Could you please have a look at this result? We tested the result on 2 nodes(16 GPUs). Thank you! trace.zip

This looks like a good algorithm. I don't expect it to do much differently for two nodes. But I believe you should see better results for larger scale.

Also, the trace file seems to be only for NCCL default algorithm. What size did you run to get this trace file?

@saeedmaleki
Copy link
Contributor

NCCL gives you 2.15 but theoretically (based purely on bandwidth) you should be able to get around 3.4 and the reason for this gap is because of the suboptimal algorithm NCCL is using for your topology. Can you please share your 64xGPU results for a large range from 1KB to 4GB (or however large it can run)?

Here is the result of nccl-tests(using origin NCCL) on 64xGPU for a large range from 1KB to 4GB:

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
        1024             4     float    none      -1   3343.1    0.00    0.00      0    560.6    0.00    0.00    N/A
        2048             8     float    none      -1    165.6    0.01    0.01      0    161.9    0.01    0.01    N/A
        4096            16     float    none      -1    160.8    0.03    0.03      0    162.2    0.03    0.02    N/A
        8192            32     float    none      -1    160.9    0.05    0.05      0    161.3    0.05    0.05    N/A
       16384            64     float    none      -1    160.9    0.10    0.10      0    160.3    0.10    0.10    N/A
       32768           128     float    none      -1    161.1    0.20    0.20      0    161.2    0.20    0.20    N/A
       65536           256     float    none      -1    173.3    0.38    0.37      0    181.3    0.36    0.36    N/A
      131072           512     float    none      -1    211.3    0.62    0.61      0    209.6    0.63    0.62    N/A
      262144          1024     float    none      -1    243.4    1.08    1.06      0    242.2    1.08    1.07    N/A
      524288          2048     float    none      -1    416.9    1.26    1.24      0    416.9    1.26    1.24    N/A
     1048576          4096     float    none      -1    821.2    1.28    1.26      0    815.6    1.29    1.27    N/A
     2097152          8192     float    none      -1   1601.1    1.31    1.29      0   1668.1    1.26    1.24    N/A
     4194304         16384     float    none      -1   3208.9    1.31    1.29      0   3200.8    1.31    1.29    N/A
     8388608         32768     float    none      -1   6545.6    1.28    1.26      0   6629.6    1.27    1.25    N/A
    16777216         65536     float    none      -1    12890    1.30    1.28      0    12827    1.31    1.29    N/A
    33554432        131072     float    none      -1    25914    1.29    1.27      0    25847    1.30    1.28    N/A
    67108864        262144     float    none      -1    52074    1.29    1.27      0    52211    1.29    1.27    N/A
   134217728        524288     float    none      -1   103897    1.29    1.27      0   103746    1.29    1.27    N/A
   268435456       1048576     float    none      -1   240116    1.12    1.10      0   206585    1.30    1.28    N/A
   536870912       2097152     float    none      -1   413198    1.30    1.28      0   475190    1.13    1.11    N/A
  1073741824       4194304     float    none      -1   823209    1.30    1.28      0   823260    1.30    1.28    N/A
  2147483648       8388608     float    none      -1  1647891    1.30    1.28      0  1646527    1.30    1.28    N/A
  4294967296      16777216     float    none      -1  3295211    1.30    1.28      0  3299375    1.30    1.28    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0.872544

These numbers are a bit unexpected. Are you sure that your IB is 200Gbps? With 200Gbps, theoretically, you should be seeing 3.57GBps algBW. If we consider 90% efficiency of IB, we should be seeing around 3.21GBps and that is a 2.5x missing performance.

Can you please run AllGather on two nodes on a long range of sizes and share the numbers? That way we can be sure about the BW of the IB.

@saeedmaleki
Copy link
Contributor

There is also an inconsistency with 1.30GBps algBW at 4GB input size on 64xGPUs. Earlier you got >2GBps for the same setting as I can see from the logs you shared. Right?

@Musisoul
Copy link
Author

This looks like a good algorithm. I don't expect it to do much differently for two nodes. But I believe you should see better results for larger scale.
Also, the trace file seems to be only for NCCL default algorithm. What size did you run to get this trace file?

Well, we did use modified 3-step MSCCL algo because "Connected 1 MSCCL algorithm" was in the log. The size is only 256MiB.

@saeedmaleki
Copy link
Contributor

This looks like a good algorithm. I don't expect it to do much differently for two nodes. But I believe you should see better results for larger scale.
Also, the trace file seems to be only for NCCL default algorithm. What size did you run to get this trace file?

Well, we did use modified 3-step MSCCL algo because "Connected 1 MSCCL algorithm" was in the log. The size is only 256MiB.

Then everything should have worked just fine. Can you please share your 3-step algorithm so that I can check it from my side? There might be some bug in MSCCL which we need to fix.

@Musisoul
Copy link
Author

These numbers are a bit unexpected. Are you sure that your IB is 200Gbps? With 200Gbps, theoretically, you should be seeing 3.57GBps algBW. If we consider 90% efficiency of IB, we should be seeing around 3.21GBps and that is a 2.5x missing performance.
Can you please run AllGather on two nodes on a long range of sizes and share the numbers? That way we can be sure about the BW of the IB.

Here is the all-gather result of 2 nodes on a long range of sizes:

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024            16     float    none      -1    60.47    0.02    0.02      0    36.20    0.03    0.03      0
        2048            32     float    none      -1    36.32    0.06    0.05      0    36.16    0.06    0.05      0
        4096            64     float    none      -1    36.12    0.11    0.11      0    36.24    0.11    0.11      0
        8192           128     float    none      -1    36.64    0.22    0.21      0    36.48    0.22    0.21      0
       16384           256     float    none      -1    37.94    0.43    0.40      0    37.50    0.44    0.41      0
       32768           512     float    none      -1    40.81    0.80    0.75      0    40.58    0.81    0.76      0
       65536          1024     float    none      -1    48.38    1.35    1.27      0    48.37    1.35    1.27      0
      131072          2048     float    none      -1    55.21    2.37    2.23      0    50.25    2.61    2.45      0
      262144          4096     float    none      -1    137.2    1.91    1.79      0    137.3    1.91    1.79      0
      524288          8192     float    none      -1    148.1    3.54    3.32      0    146.8    3.57    3.35      0
     1048576         16384     float    none      -1    171.5    6.11    5.73      0    168.0    6.24    5.85      0
     2097152         32768     float    none      -1    204.1   10.27    9.63      0    203.8   10.29    9.65      0
     4194304         65536     float    none      -1    300.7   13.95   13.07      0    296.9   14.12   13.24      0
     8388608        131072     float    none      -1    489.4   17.14   16.07      0    488.3   17.18   16.10      0
    16777216        262144     float    none      -1    971.2   17.27   16.20      0    965.4   17.38   16.29      0
    33554432        524288     float    none      -1   2006.8   16.72   15.68      0   1962.3   17.10   16.03      0
    67108864       1048576     float    none      -1   3956.8   16.96   15.90      0   3924.3   17.10   16.03      0
   134217728       2097152     float    none      -1   7882.9   17.03   15.96      0   7843.1   17.11   16.04      0
   268435456       4194304     float    none      -1    15505   17.31   16.23      0    15366   17.47   16.38      0
   536870912       8388608     float    none      -1    31401   17.10   16.03      0    31144   17.24   16.16      0
  1073741824      16777216     float    none      -1    61484   17.46   16.37      0    59921   17.92   16.80      0
  2147483648      33554432     float    none      -1   120583   17.81   16.70      0   119265   18.01   16.88      0
  4294967296      67108864     float    none      -1   240501   17.86   16.74      0   235906   18.21   17.07      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 8.76973

all_gather_n16.log

@saeedmaleki
Copy link
Contributor

These numbers are a bit unexpected. Are you sure that your IB is 200Gbps? With 200Gbps, theoretically, you should be seeing 3.57GBps algBW. If we consider 90% efficiency of IB, we should be seeing around 3.21GBps and that is a 2.5x missing performance.
Can you please run AllGather on two nodes on a long range of sizes and share the numbers? That way we can be sure about the BW of the IB.

Here is the all-gather result of 2 nodes on a long range of sizes:

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024            16     float    none      -1    60.47    0.02    0.02      0    36.20    0.03    0.03      0
        2048            32     float    none      -1    36.32    0.06    0.05      0    36.16    0.06    0.05      0
        4096            64     float    none      -1    36.12    0.11    0.11      0    36.24    0.11    0.11      0
        8192           128     float    none      -1    36.64    0.22    0.21      0    36.48    0.22    0.21      0
       16384           256     float    none      -1    37.94    0.43    0.40      0    37.50    0.44    0.41      0
       32768           512     float    none      -1    40.81    0.80    0.75      0    40.58    0.81    0.76      0
       65536          1024     float    none      -1    48.38    1.35    1.27      0    48.37    1.35    1.27      0
      131072          2048     float    none      -1    55.21    2.37    2.23      0    50.25    2.61    2.45      0
      262144          4096     float    none      -1    137.2    1.91    1.79      0    137.3    1.91    1.79      0
      524288          8192     float    none      -1    148.1    3.54    3.32      0    146.8    3.57    3.35      0
     1048576         16384     float    none      -1    171.5    6.11    5.73      0    168.0    6.24    5.85      0
     2097152         32768     float    none      -1    204.1   10.27    9.63      0    203.8   10.29    9.65      0
     4194304         65536     float    none      -1    300.7   13.95   13.07      0    296.9   14.12   13.24      0
     8388608        131072     float    none      -1    489.4   17.14   16.07      0    488.3   17.18   16.10      0
    16777216        262144     float    none      -1    971.2   17.27   16.20      0    965.4   17.38   16.29      0
    33554432        524288     float    none      -1   2006.8   16.72   15.68      0   1962.3   17.10   16.03      0
    67108864       1048576     float    none      -1   3956.8   16.96   15.90      0   3924.3   17.10   16.03      0
   134217728       2097152     float    none      -1   7882.9   17.03   15.96      0   7843.1   17.11   16.04      0
   268435456       4194304     float    none      -1    15505   17.31   16.23      0    15366   17.47   16.38      0
   536870912       8388608     float    none      -1    31401   17.10   16.03      0    31144   17.24   16.16      0
  1073741824      16777216     float    none      -1    61484   17.46   16.37      0    59921   17.92   16.80      0
  2147483648      33554432     float    none      -1   120583   17.81   16.70      0   119265   18.01   16.88      0
  4294967296      67108864     float    none      -1   240501   17.86   16.74      0   235906   18.21   17.07      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 8.76973

all_gather_n16.log

OK great! This is >12.5GBps which means that your IB's bw is close to 25GBps (or 200Gbps). Therefore, your AllToAll can be hugely improved by utilizing the right algorithm. Please share you 3-step algorithm so that I can take a look :).

@Musisoul
Copy link
Author

OK great! This is >12.5GBps which means that your IB's bw is close to 25GBps (or 200Gbps). Therefore, your AllToAll can be hugely improved by utilizing the right algorithm. Please share you 3-step algorithm so that I can take a look :).

Okay. Thanks!
alltoall_a100_three_step.py.zip

@Musisoul
Copy link
Author

There is also an inconsistency with 1.30GBps algBW at 4GB input size on 64xGPUs. Earlier you got >2GBps for the same setting as I can see from the logs you shared. Right?

It could have some fluctuations, because we tested in a cluster and the machines were not fixed. BTW, the IBs are not all 200Gbps, each node has 3 IBs. Here is the result of 'ibstatus' in a node:

1

@saeedmaleki
Copy link
Contributor

There is also an inconsistency with 1.30GBps algBW at 4GB input size on 64xGPUs. Earlier you got >2GBps for the same setting as I can see from the logs you shared. Right?

It could have some fluctuations, because we tested in a cluster and the machines were not fixed. BTW, the IBs are not all 200Gbps, each node has 3 IBs. Here is the result of 'ibstatus' in a node:

1

You need to look at NCCL log NCCL_DEBUG=INFO to see which one it uses. My guess is that it will only use the fastest one since I think (could be wrong here) your PCIe bandwidth can at max draw 200Gbps.

@Musisoul
Copy link
Author

You need to look at NCCL log NCCL_DEBUG=INFO to see which one it uses. My guess is that it will only use the fastest one since I think (could be wrong here) your PCIe bandwidth can at max draw 200Gbps.

It should use the fastest one(0), because we can find some infos in previous logs:

SH-IDC1-10-142-5-97:73905:73905 [2] NCCL INFO Using network IB
SH-IDC1-10-142-5-97:73910:73910 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/RoCE ; OOB ib0:10.142.205.97<0>
SH-IDC1-10-142-6-127:49330:49330 [6] NCCL INFO Using network IB
SH-IDC1-10-142-6-127:49324:49324 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/RoCE ; OOB ib0:10.142.206.127<0>
...
SH-IDC1-10-142-5-203:76752:77230 [7] NCCL INFO Channel 01 : 23[dc000] -> 24[41000] [send] via NET/IB/0/Shared
SH-IDC1-10-142-5-203:76745:77233 [0] NCCL INFO Channel 01 : 15[dc000] -> 16[51000] [receive] via NET/IB/0/Shared
SH-IDC1-10-142-5-203:76746:77234 [1] NCCL INFO Channel 01 : 17[56000] -> 18[6b000] via P2P/IPC/read
SH-IDC1-10-142-5-203:76751:77235 [6] NCCL INFO Channel 00 : 22[d9000] -> 24[41000] [send] via NET/IB/0/Shared
SH-IDC1-10-142-5-203:76746:77234 [1] NCCL INFO Channel 00 : 15[dc000] -> 17[56000] [receive] via NET/IB/0/Shared
...

@Jack47
Copy link

Jack47 commented Dec 2, 2022

Therefore, your AllToAll can be hugely improved by utilizing the right algorithm.

wow, great news. May you give some hints on it? we want to use small chunks on ib to make it faster, but seems currently msccl use fixed chunks in all2all?

@Jack47
Copy link

Jack47 commented Dec 11, 2022

@saeedmaleki long time no see!
may you give some hints on it, we really needs to improve all2all latency, currently it consumes 78% for one moe layers.

@saeedmaleki
Copy link
Contributor

Hi @Jack47, I took a look at your algorithm and it seems that the local node communication has too many steps which can be optimized. To evaluate this, I would suggest two things:
(1) try to get NPKit to generate the json files and generate a trace so that we can study where your performance is going. I am not entirely sure why you couldn't get it to work, if there is a bug, please let us know. You could first look at 2-node alltoall with MSCCL and see if that makes a difference. Note that NPKit generates really large log files, so please be mindful and do not run nccl-tests with too many iterations.
(2) trying all-to-all with one GPU per node on 8 nodes. You could use default NCCL's algorithm to study the performance. That gives us a good idea if the cross-node communication has any bottlenecks.

Do these steps make sense?

@Jack47
Copy link

Jack47 commented Dec 21, 2022

hi @saeedmaleki , great thanks for your advice, there is two nodes NPKit result: https://github.com/Musisoul/NPKIT-results/blob/main/trace.zip. we will try one GPU per node on 8 nodes.

@saeedmaleki
Copy link
Contributor

It seems that this is still using default NCCL algorithm. Did you try the algorithm you developed for this run?

@Musisoul
Copy link
Author

It seems that this is still using default NCCL algorithm. Did you try the algorithm you developed for this run?

Well, we did use modified 3-step MSCCL algo because "Connected 1 MSCCL algorithm" was in the log. The size is only 256MiB.

Maybe we should try another time, but the previous NPKIT result was generated by modified 3-step algo. We will also try one GPU per node on 8 nodes. These tests may take some time.

@Musisoul
Copy link
Author

Musisoul commented Jan 7, 2023

Hi @Jack47, I took a look at your algorithm and it seems that the local node communication has too many steps which can be optimized. To evaluate this, I would suggest two things: (1) try to get NPKit to generate the json files and generate a trace so that we can study where your performance is going. I am not entirely sure why you couldn't get it to work, if there is a bug, please let us know. You could first look at 2-node alltoall with MSCCL and see if that makes a difference. Note that NPKit generates really large log files, so please be mindful and do not run nccl-tests with too many iterations. (2) trying all-to-all with one GPU per node on 8 nodes. You could use default NCCL's algorithm to study the performance. That gives us a good idea if the cross-node communication has any bottlenecks.

Do these steps make sense?

Sorry for not replying for a long time. We have tried all-to-all test with one GPU per node on 8 nodes. Here is the result:

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s) 
     1048576         32768     float    none      -1   2522.4    0.42    0.36      0    149.6    7.01    6.13    N/A
     2097152         65536     float    none      -1    249.5    8.40    7.35      0    248.3    8.45    7.39    N/A
     4194304        131072     float    none      -1    453.7    9.24    8.09      0    427.6    9.81    8.58    N/A
     8388608        262144     float    none      -1    790.4   10.61    9.29      0   1273.5    6.59    5.76    N/A
    16777216        524288     float    none      -1   1511.7   11.10    9.71      0   1597.4   10.50    9.19    N/A
    33554432       1048576     float    none      -1   3115.8   10.77    9.42      0   3281.4   10.23    8.95    N/A
    67108864       2097152     float    none      -1   6121.0   10.96    9.59      0   6137.1   10.93    9.57    N/A
   134217728       4194304     float    none      -1    12126   11.07    9.69      0    12267   10.94    9.57    N/A
   268435456       8388608     float    none      -1    24781   10.83    9.48      0    23934   11.22    9.81    N/A
   536870912      16777216     float    none      -1    48064   11.17    9.77      0    47966   11.19    9.79    N/A
  1073741824      33554432     float    none      -1    95345   11.26    9.85      0    95303   11.27    9.86    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 8.5103

The log is attached below:
test-8node8gpu.log

We have tried the NPKIT on 2 nodes using our modified three-step algo. Here is the result:
npkit_event_trace_20230107.zip

We found that when using tools/npkit_trace_generator.py to generate json from npkit_dump_dir, there were some bugs in the python script:

Traceback (most recent call last):
  File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 232, in <module>
    convert_npkit_dump_to_trace(args.input_dir, args.output_dir, npkit_event_def)
  File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 211, in convert_npkit_dump_to_trace
    gpu_events = parse_gpu_event_file(npkit_dump_dir, npkit_event_def, rank, buf_idx, gpu_clock_scale, cpu_clock_scale)
  File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 117, in parse_gpu_event_file
    gpu_events[-1]['args']['bw (GB/s)'] = gpu_events[-1]['args']['size'] / delta_time / 1e3
ZeroDivisionError: float division by zero

Traceback (most recent call last):
  File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 233, in <module>
    convert_npkit_dump_to_trace(args.input_dir, args.output_dir, npkit_event_def)
  File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 212, in convert_npkit_dump_to_trace
    gpu_events = parse_gpu_event_file(npkit_dump_dir, npkit_event_def, rank, buf_idx, gpu_clock_scale, cpu_clock_scale)
  File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 116, in parse_gpu_event_file
    delta_time = gpu_events[-1]['ts'] - gpu_events[-2]['ts']
IndexError: list index out of range

Traceback (most recent call last):
  File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 235, in <module>
    convert_npkit_dump_to_trace(args.input_dir, args.output_dir, npkit_event_def)
  File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 214, in convert_npkit_dump_to_trace
    gpu_events = parse_gpu_event_file(npkit_dump_dir, npkit_event_def, rank, buf_idx, gpu_clock_scale, cpu_clock_scale)
  File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 95, in parse_gpu_event_file
    'ts': curr_cpu_base_time + parsed_gpu_event['timestamp'] / gpu_clock_scale - curr_gpu_base_time,
TypeError: unsupported operand type(s) for +: 'NoneType' and 'float'

We circumvented these bugs and generated the json. Previously you said "it seems that this is still using default NCCL algorithm", will these bugs affect the results? Thank you!

@Musisoul
Copy link
Author

Musisoul commented Jan 7, 2023

OK great! This is >12.5GBps which means that your IB's bw is close to 25GBps (or 200Gbps). Therefore, your AllToAll can be hugely improved by utilizing the right algorithm. Please share you 3-step algorithm so that I can take a look :).

Okay. Thanks! alltoall_a100_three_step.py.zip

Previously we provided you the modified three-step algo, what is the performance of the algo on your machines? Does this code need to be improved or has some bugs? We are looking forward for your reply. Thank you!

@Musisoul
Copy link
Author

Musisoul commented Jan 9, 2023

We have tried the NPKIT on 2 nodes using our modified three-step algo. Here is the result:
npkit_event_trace_20230107.zip

Update: This file could be too large for view tracer to open, we retested the modified three-step algo with smaller iterations and got this result:
npkit_event_trace_20230108.zip

@saeedmaleki
Copy link
Contributor

Hi @Jack47, I took a look at your algorithm and it seems that the local node communication has too many steps which can be optimized. To evaluate this, I would suggest two things: (1) try to get NPKit to generate the json files and generate a trace so that we can study where your performance is going. I am not entirely sure why you couldn't get it to work, if there is a bug, please let us know. You could first look at 2-node alltoall with MSCCL and see if that makes a difference. Note that NPKit generates really large log files, so please be mindful and do not run nccl-tests with too many iterations. (2) trying all-to-all with one GPU per node on 8 nodes. You could use default NCCL's algorithm to study the performance. That gives us a good idea if the cross-node communication has any bottlenecks.
Do these steps make sense?

Sorry for not replying for a long time. We have tried all-to-all test with one GPU per node on 8 nodes. Here is the result:

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s) 
     1048576         32768     float    none      -1   2522.4    0.42    0.36      0    149.6    7.01    6.13    N/A
     2097152         65536     float    none      -1    249.5    8.40    7.35      0    248.3    8.45    7.39    N/A
     4194304        131072     float    none      -1    453.7    9.24    8.09      0    427.6    9.81    8.58    N/A
     8388608        262144     float    none      -1    790.4   10.61    9.29      0   1273.5    6.59    5.76    N/A
    16777216        524288     float    none      -1   1511.7   11.10    9.71      0   1597.4   10.50    9.19    N/A
    33554432       1048576     float    none      -1   3115.8   10.77    9.42      0   3281.4   10.23    8.95    N/A
    67108864       2097152     float    none      -1   6121.0   10.96    9.59      0   6137.1   10.93    9.57    N/A
   134217728       4194304     float    none      -1    12126   11.07    9.69      0    12267   10.94    9.57    N/A
   268435456       8388608     float    none      -1    24781   10.83    9.48      0    23934   11.22    9.81    N/A
   536870912      16777216     float    none      -1    48064   11.17    9.77      0    47966   11.19    9.79    N/A
  1073741824      33554432     float    none      -1    95345   11.26    9.85      0    95303   11.27    9.86    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 8.5103

The log is attached below: test-8node8gpu.log

We have tried the NPKIT on 2 nodes using our modified three-step algo. Here is the result: npkit_event_trace_20230107.zip

We found that when using tools/npkit_trace_generator.py to generate json from npkit_dump_dir, there were some bugs in the python script:

Traceback (most recent call last):
  File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 232, in <module>
    convert_npkit_dump_to_trace(args.input_dir, args.output_dir, npkit_event_def)
  File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 211, in convert_npkit_dump_to_trace
    gpu_events = parse_gpu_event_file(npkit_dump_dir, npkit_event_def, rank, buf_idx, gpu_clock_scale, cpu_clock_scale)
  File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 117, in parse_gpu_event_file
    gpu_events[-1]['args']['bw (GB/s)'] = gpu_events[-1]['args']['size'] / delta_time / 1e3
ZeroDivisionError: float division by zero

Traceback (most recent call last):
  File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 233, in <module>
    convert_npkit_dump_to_trace(args.input_dir, args.output_dir, npkit_event_def)
  File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 212, in convert_npkit_dump_to_trace
    gpu_events = parse_gpu_event_file(npkit_dump_dir, npkit_event_def, rank, buf_idx, gpu_clock_scale, cpu_clock_scale)
  File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 116, in parse_gpu_event_file
    delta_time = gpu_events[-1]['ts'] - gpu_events[-2]['ts']
IndexError: list index out of range

Traceback (most recent call last):
  File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 235, in <module>
    convert_npkit_dump_to_trace(args.input_dir, args.output_dir, npkit_event_def)
  File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 214, in convert_npkit_dump_to_trace
    gpu_events = parse_gpu_event_file(npkit_dump_dir, npkit_event_def, rank, buf_idx, gpu_clock_scale, cpu_clock_scale)
  File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 95, in parse_gpu_event_file
    'ts': curr_cpu_base_time + parsed_gpu_event['timestamp'] / gpu_clock_scale - curr_gpu_base_time,
TypeError: unsupported operand type(s) for +: 'NoneType' and 'float'

We circumvented these bugs and generated the json. Previously you said "it seems that this is still using default NCCL algorithm", will these bugs affect the results? Thank you!

This is your key results pinpointing the problem. You are getting 9.86 GBps busBW which is a good 2x off. it should have been ~ 22-25 GBps which is your IB's BW. You might have a bad node in the system. I suggest narrowing down the experiment to 2 and 4 GPUs per experiment to find the problematic node. As far as I remember your AllGather result had great numbers, so it seems like something is off.

Without fixing this issue, maximum busBW you may get on 64 GPUs is 9.85/8~1.23 GBps which is way below what it needs to be.

Please let me know of your investigation and we can find the problematic node.

@saeedmaleki
Copy link
Contributor

We have tried the NPKIT on 2 nodes using our modified three-step algo. Here is the result:
npkit_event_trace_20230107.zip

Update: This file could be too large for view tracer to open, we retested the modified three-step algo with smaller iterations and got this result: npkit_event_trace_20230108.zip

I couldn't open the zip file, please reupload it?

@Musisoul
Copy link
Author

We have tried the NPKIT on 2 nodes using our modified three-step algo. Here is the result:
npkit_event_trace_20230107.zip

Update: This file could be too large for view tracer to open, we retested the modified three-step algo with smaller iterations and got this result: npkit_event_trace_20230108.zip

I couldn't open the zip file, please reupload it?

I can unzip the npkit_event_trace_20230108.zip, so should I upload the json directly?

@liecn
Copy link

liecn commented Feb 11, 2023

#48 (comment)
When I use the forked repo and run
"make MPI=1 NCCL_HOME=../msccl/build/ -j'",
why do I get the error as
'alltoall.cu(72): error: identifier "ncclAllToAll" is undefined'.

Seems nccl-test doesn't refer to msccl correctly.

@saeedmaleki
Copy link
Contributor

We have tried the NPKIT on 2 nodes using our modified three-step algo. Here is the result:
npkit_event_trace_20230107.zip

Update: This file could be too large for view tracer to open, we retested the modified three-step algo with smaller iterations and got this result: npkit_event_trace_20230108.zip

I couldn't open the zip file, please reupload it?

I can unzip the npkit_event_trace_20230108.zip, so should I upload the json directly?

Oops sorry I dropped the ball on this one. Yes please

@saeedmaleki
Copy link
Contributor

#48 (comment) When I use the forked repo and run "make MPI=1 NCCL_HOME=../msccl/build/ -j'", why do I get the error as 'alltoall.cu(72): error: identifier "ncclAllToAll" is undefined'.

Seems nccl-test doesn't refer to msccl correctly.

I suggest trying giving it an absolute path instead of relative.

@baymaxhuang
Copy link

There is the answer, it's connected to GPU 0 and 1 over a PCIe switch.

Regarding the proper algorithm for your use case: you might wanna use something like the 3-step algorithm for alltoall except that the relay logic needs to change. This means that we need to (1) gather cross-node traffic to a local GPU (let's say local GPU 0) and (2) then GPU 0 sends cross-node data to local GPU 0 on other nodes. (3) Lastly, GPU 0 scatters the data to everyone. Thus, the name 3-step algorithm.

Cool! We can simply modify CrossNodeGpus to just use local rank 0 to implement the algo you mentioned?

Exactly! BTW, I am not sure if it will work better for you, but just judging from my intuitions.

By the way, what's the difference between alltoall two step and three step? two step : (1) gather cross-node traffic to a local GPU and (2) then local GPU sends cross-node data to local GPU peers on other nodes. so in two step we don't group chunks into a larger one?

So, in the two-step algorithm, all of the cross-node traffic, for example, to local GPU 6 on node 0 is via local GPU 6 on other nodes. So, local GPU i on node j aggregates the traffic from other local GPUs on node j that needs to go to other local GPU i on other nodes.

Hi, I see from NCCL 2.12, the default NCCL primitives have used PXN to optimize cross-node traffic. PXN will also aggregate cross-node messages to the same local GPU and then send them with the shared connections. Thus, Is there any difference between MSCCL two-step algorithm with PXN optimization in NCCL?

In three-step algorithm, the aggregation is more aggressive, and it was only beneficial in our experiments at 64 node scale and larger. In this algorithm, node i and node j communicate via only one local GPU on each (the CrossNodeGpus logigc). In your topology, because you have only a single IB, the three-step algorithm with a more aggressive aggregation would make more sense. We will never know until we try!

Also, try NPKit profiler for further analysis of your algorithm!

@saeedmaleki
Copy link
Contributor

From our experience, there are still differences at 64 GPUs and up but PXN does a pretty good job as well. But after 1024 GPUs, you will need to switch to the 3-step algorithm. I think PXN is disabled after certain number of GPUs.

@baymaxhuang
Copy link

baymaxhuang commented Jan 3, 2024

From our experience, there are still differences at 64 GPUs and up but PXN does a pretty good job as well. But after 1024 GPUs, you will need to switch to the 3-step algorithm. I think PXN is disabled after certain number of GPUs.

If I understand correctly, in 3-step algorithm, only one GPU's NIC will be used to exchange cross-node traffic. If one node has 8GPU and 8 NIC,the available bandwidth will be decreased to 1/8. Could you show more experience why 3-step algorithm could improve performance after 1024 GPUs?

@saeedmaleki
Copy link
Contributor

From our experience, there are still differences at 64 GPUs and up but PXN does a pretty good job as well. But after 1024 GPUs, you will need to switch to the 3-step algorithm. I think PXN is disabled after certain number of GPUs.

If I understand correctly, in 3-step algorithm, only one GPU's NIC will be used to exchange cross-node traffic. If one node has 8GPU and 8 NIC,the available bandwidth will be decreased to 1/8. Could you show more experience why 3-step algorithm could improve performance after 1024 GPUs?

Not exactly. Imagine you have 9 nodes in total. On node-0, NIC-0 will be talking to node-1, NIC-1 will be talking to node-2, ... , and NIC-7 will be talking to node-8. So, you can imagine each first step is a local-gather operation, second step is a cross-node communication and last step is a local scatter operation. This reduces the number of cross-node connections by another 8x over the 2-step algorithm.

In general, for a perfect load-balance, we need 8K+1 nodes for 3-step algorithm. However, at scale (1024 GPUs for example), the load imbalance is not too bad.

@baymaxhuang
Copy link

From our experience, there are still differences at 64 GPUs and up but PXN does a pretty good job as well. But after 1024 GPUs, you will need to switch to the 3-step algorithm. I think PXN is disabled after certain number of GPUs.

Is there any experiment that could show that 3-step algorithm could perform better than PXN in certain number of GPU or message size? In my experiment, I conducted an experiment with 96 GPUs (12 node8 * 8), and the result shows that 3-step algorithm performs really worse then native alltoall implemented with send/recv .

Native alltoall with PXN:

image

3-step algorithm with PXN disable:
image

3-step algorithm MSCCL_XML_FILES:

three_step_96.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants