You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi all, I was running the Makefile for the first time, but found that it was failing with this message:
---------------------------------------------
→ cuDNN is manually disabled by default, run make with `USE_CUDNN=1` to try to enable
✓ OpenMP found
✓ NCCL found, OK to train with multiple GPUs
✗ MPI not found
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
/usr/bin/nvcc --threads=0 -t=0 --use_fast_math -std=c++17 -O3 -DMULTI_GPU train_gpt2_fp32.cu -lcublas -lcublasLt -lnvidia-ml -lnccl -o train_gpt2fp32cu
train_gpt2_fp32.cu(62): warning #550-D: variable "cublas_compute_type" was set but never used
/usr/bin/ld: cannot find -lnccl: No such file or directory
collect2: error: ld returned 1 exit status
make: *** [Makefile:277: train_gpt2fp32cu] Error 255
It turns out, the makefile is using the following grep of a dpkg -l call to check if nccl is installed. This gives a false positive if the dpkg prints out any package with the substring nccl, such as "libvncclient1", in my case. Here's the actual code causing the issue:
# Check if NCCL is available, include if so, for multi-GPU training
ifeq ($(NO_MULTI_GPU), 1)
$(info → Multi-GPU (NCCL) is manually disabled)
else
ifneq ($(OS), Windows_NT)
# Detect if running on macOS or Linux
ifeq ($(SHELL_UNAME), Darwin)
$(info ✗ Multi-GPU on CUDA on Darwin is not supported, skipping NCCL support)
+ else ifeq ($(shell dpkg -l | grep -q nccl && echo "exists"), exists)
$(info ✓ NCCL found, OK to train with multiple GPUs)
NVCC_FLAGS += -DMULTI_GPU
NVCC_LDLIBS += -lnccl
else
$(info ✗ NCCL is not found, disabling multi-GPU support)
$(info ---> On Linux you can try install NCCL with `sudo apt install libnccl2 libnccl-dev`)
endif
endif
endif
If I have some free time I think this would be a fun first issue and I'd be glad to contribute, but if anyone knows the fix off of the top of their head, that would be nice as well!
The text was updated successfully, but these errors were encountered:
OS: Ubuntu 22.04.5 LTS
Hi all, I was running the Makefile for the first time, but found that it was failing with this message:
It turns out, the makefile is using the following
grep
of adpkg -l
call to check ifnccl
is installed. This gives a false positive if thedpkg
prints out any package with the substringnccl
, such as "libvncclient1", in my case. Here's the actual code causing the issue:If I have some free time I think this would be a fun first issue and I'd be glad to contribute, but if anyone knows the fix off of the top of their head, that would be nice as well!
The text was updated successfully, but these errors were encountered: