Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA] Multi-GPU and distributed training for new CUDA version. #5076

Open
Tracked by #5153
shiyu1994 opened this issue Mar 15, 2022 · 1 comment
Open
Tracked by #5153

[CUDA] Multi-GPU and distributed training for new CUDA version. #5076

shiyu1994 opened this issue Mar 15, 2022 · 1 comment

Comments

@shiyu1994
Copy link
Collaborator

Summary

Add multi-gpu and distributed support for new CUDA version. As mentioned in #4630 (comment)

@flybywind
Copy link

flybywind commented Sep 20, 2024

Hi, @shiyu1994 I find in the latest 4.5.0 there is no support for multi-gpu training. If do user will get errors: "Currently cuda version only supports training on a single GPU".
Then I find this branch nccl-dev your're working, wondering when can u release it?
Anyway I tried to run it in my env:

NVIDIA-Tesla P100, Driver Version: 470.82.01 CUDA Version: 11.4
NCCL: nccl_2.11.4-1+cuda11.4_x86_64
OS: Debian bullseye

When I set num_gpus = 1, it works well. But when I set it to 2, the both gpu can allocate memory, but only one is running in 100%. and it seems never stop。 Any clue for it ?
image
From the output of nccl-test, the GPUs are connected well:

$ build/all_reduce_perf -b 2m -e 100m -f 10 -g2
# nThread 1 nGpus 2 minBytes 2097152 maxBytes 104857600 step: 10(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid    294 on dlc1ovtssn2jsl15-master-0 device  0 [0x00] Tesla P100-PCIE-16GB
#  Rank  1 Group  0 Pid    294 on dlc1ovtssn2jsl15-master-0 device  1 [0x00] Tesla P100-PCIE-16GB
NCCL version 2.11.4+cuda11.4
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     2097152        524288     float     sum      -1    539.7    3.89    3.89      0    537.7    3.90    3.90      0
    20971520       5242880     float     sum      -1   5473.1    3.83    3.83      0   5470.2    3.83    3.83      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 3.86278 

And here are the nccl logs:

[0] NCCL INFO Bootstrap : Using eth0:10.224.144.56<0>
[0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[0] NCCL INFO Failed to open libibverbs.so[.1]
[0] NCCL INFO NET/Socket : Using [0]eth0:10.224.144.56<0> [1]eth1:10.252.7.41<0>
[0] NCCL INFO Using network Socket
[0] NCCL INFO NCCL version 2.11.4+cuda11.4
[0] NCCL INFO Channel 00/02 : 0 1
[0] NCCL INFO Channel 01/02 : 0 1
[1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
[0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
[0] NCCL INFO Channel 00 : 0[80] -> 1[90] via direct shared memory
[1] NCCL INFO Channel 00 : 1[90] -> 0[80] via direct shared memory
[0] NCCL INFO Channel 01 : 0[80] -> 1[90] via direct shared memory
[1] NCCL INFO Channel 01 : 1[90] -> 0[80] via direct shared memory
[0] NCCL INFO Connected all rings
[1] NCCL INFO Connected all rings
[0] NCCL INFO Connected all trees
[1] NCCL INFO Connected all trees
[1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
[1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
[0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
[0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
[0] NCCL INFO comm 0x7f2d44054230 rank 0 nranks 2 cudaDev 0 busId 80 - Init COMPLETE
[1] NCCL INFO comm 0x7f2d24056d60 rank 1 nranks 2 cudaDev 1 busId 90 - Init COMPLETE
[0] NCCL INFO Launch mode Parallel

And the full stack info like this:

#0  0x00007ffff7fd0abc in clock_gettime ()
#1  0x00007ffff7981121 in __GI___clock_gettime (clock_id=4, tp=0x7fffffffcd10) at ../sysdeps/unix/sysv/linux/clock_gettime.c:38
#2  0x00007fff0aa0b0af in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fff0a9310a3 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fff0a8d31cf in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007fff0a8d4818 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007fff0a9c096a in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7  0x00007fff855517c9 in __cudart1044 () from /usr/local/lib/python3.11/site-packages/lightgbm/lib/lib_lightgbm.so
#8  0x00007fff855862e5 in cudaDeviceSynchronize () from /usr/local/lib/python3.11/site-packages/lightgbm/lib/lib_lightgbm.so
#9  0x00007fff8532a92b in LightGBM::SynchronizeCUDADevice (file=0x7fff855ea2d8 "/mnt/workspace/lgbm-cuda-install/LightGBM/src/io/cuda/cuda_tree.cu", line=454)
    at /mnt/workspace/lgbm-cuda-install/LightGBM/src/cuda/cuda_utils.cpp:13
#10 0x00007fff853469d3 in LightGBM::CUDATree::LaunchAddPredictionToScoreKernel (this=0x7ffef4fdb8b0, data=0x55555c62caa0, used_data_indices=0x0, num_data=100, score=0x7ffeb3f64200)
    at /mnt/workspace/lgbm-cuda-install/LightGBM/src/io/cuda/cuda_tree.cu:454
#11 0x00007fff85344c90 in LightGBM::CUDATree::AddPredictionToScore (this=0x7ffef4fdb8b0, data=0x55555c62caa0, num_data=100, score=0x7ffeb3f64200)
    at /mnt/workspace/lgbm-cuda-install/LightGBM/src/io/cuda/cuda_tree.cpp:256
#12 0x00007fff85316f21 in LightGBM::CUDAScoreUpdater::AddScore (this=0x555569b3ea50, tree=0x7ffef4fdb8b0, cur_tree_id=0)
    at /mnt/workspace/lgbm-cuda-install/LightGBM/src/boosting/cuda/cuda_score_updater.cpp:58
#13 0x00007fff8531ca43 in LightGBM::NCCLGBDT<LightGBM::GBDT>::TrainOneIter(float const*, float const*)::{lambda(LightGBM::NCCLGBDTComponent*)#4}::operator()(LightGBM::NCCLGBDTComponent*) const (
    this=0x55555b24c200, thread_data=0x5555582d6cc0) at /mnt/workspace/lgbm-cuda-install/LightGBM/src/boosting/cuda/nccl_gbdt.cpp:129
#14 0x00007fff85322216 in std::__invoke_impl<void, LightGBM::NCCLGBDT<LightGBM::GBDT>::TrainOneIter(float const*, float const*)::{lambda(LightGBM::NCCLGBDTComponent*)#4}&, LightGBM::NCCLGBDTComponent*>(std::__invoke_other, LightGBM::NCCLGBDT<LightGBM::GBDT>::TrainOneIter(float const*, float const*)::{lambda(LightGBM::NCCLGBDTComponent*)#4}&, LightGBM::NCCLGBDTComponent*&&) (__f=...)
    at /usr/include/c++/10/bits/invoke.h:60
#15 0x00007fff85320dc6 in std::__invoke_r<void, LightGBM::NCCLGBDT<LightGBM::GBDT>::TrainOneIter(float const*, float const*)::{lambda(LightGBM::NCCLGBDTComponent*)#4}&, LightGBM::NCCLGBDTComponent*>(LightGBM::NCCLGBDT<LightGBM::GBDT>::TrainOneIter(float const*, float const*)::{lambda(LightGBM::NCCLGBDTComponent*)#4}&, LightGBM::NCCLGBDTComponent*&&) (__fn=...) at /usr/include/c++/10/bits/invoke.h:153
#16 0x00007fff8531f6b7 in std::_Function_handler<void (LightGBM::NCCLGBDTComponent*), LightGBM::NCCLGBDT<LightGBM::GBDT>::TrainOneIter(float const*, float const*)::{lambda(LightGBM::NCCLGBDTComponent*)#4}>::_M_invoke(std::_Any_data const&, LightGBM::NCCLGBDTComponent*&&) (__functor=..., __args#0=@0x7fffffffd4d0: 0x5555582d6cc0) at /usr/include/c++/10/bits/std_function.h:291
#17 0x00007fff8531f161 in std::function<void (LightGBM::NCCLGBDTComponent*)>::operator()(LightGBM::NCCLGBDTComponent*) const (this=0x7fffffffd670, __args#0=0x5555582d6cc0)
    at /usr/include/c++/10/bits/std_function.h:622
#18 0x00007fff8531e1f5 in LightGBM::NCCLTopology::RunOnMasterDevice<LightGBM::NCCLGBDTComponent, void>(std::vector<std::unique_ptr<LightGBM::NCCLGBDTComponent, std::default_delete<LightGBM::NCCLGBDTComponent> >, std::allocator<std::unique_ptr<LightGBM::NCCLGBDTComponent, std::default_delete<LightGBM::NCCLGBDTComponent> > > > const&, std::function<void (LightGBM::NCCLGBDTComponent*)> const&) (this=0x555558d53cb0, 
    objs=std::vector of length 2, capacity 2 = {...}, func=...) at /mnt/workspace/lgbm-cuda-install/LightGBM/include/LightGBM/cuda/cuda_nccl_topology.hpp:185
#19 0x00007fff8531b30b in LightGBM::NCCLGBDT<LightGBM::GBDT>::TrainOneIter (this=0x55555b24c200, gradients=0x0, hessians=0x0) at /mnt/workspace/lgbm-cuda-install/LightGBM/src/boosting/cuda/nccl_gbdt.cpp:124
#20 0x00007fff84ce7b4e in LightGBM::Booster::TrainOneIter (this=0x55555b6fe030) at /mnt/workspace/lgbm-cuda-install/LightGBM/src/c_api.cpp:407
#21 0x00007fff84cd3f70 in LGBM_BoosterUpdateOneIter (handle=0x55555b6fe030, is_finished=0x7fff82b14130) at /mnt/workspace/lgbm-cuda-install/LightGBM/src/c_api.cpp:2070

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants