-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSL issue on nodes #31
Comments
This looks like a high priority issue @ddamoursNRC @NRCGavin |
Could you share your job submission script so I can test? You should have Internet access from the nodes. |
Hi @joeydumont , I'm trying to install Sockeye-2 which has support for horovod a distributed deep learning training framework. I'm trying to follow the following guide: Build a Conda Environment with GPU Support for Horovod but with some added dependencies for Sockeye-2. The original guide's intent is to make a conda environment with all the major Deep Learning frameworks plus jupyter. I'm not sure that my scripts are fully functional yet because I can't get them to access the internet or CUDA ;) but here's what I've got so far. Under
Once the environment is properly created
We can see that NCCL wasn't detected even though it is part of the build file. |
Hi Sam, I tried this today, and I had the same errors as you during the download, but I had the same issue on both the head node and the compute nodes. The error is related to pip (or rather the The fact that it worked at night (I was just able to run the install job on the compute node) makes me think that some networking appliance was having trouble keeping up. The issue was happening consistently when trying to download mxnet-cuda101, which is about 750MB in size. Even using Here's the script I used to install #!/bin/bash
#SBATCH -p JobTesting
#SBATCH -A itops
#SBATCH --time=2:00:00
#SBATCH --gres=gpu:4
#SBATCH [email protected]
#SBATCH --mail-type=ALL
source /project/WMT20/setup_tools
export OMPI_MCA_opal_cuda_support=true
export ENV_PREFIX=$CONDA_PREFIX/Sockeye-2.1.21
export CUDA_HOME=/usr/local/cuda-10.1
export HOROVOD_CUDA_HOME=$CUDA_HOME
export HOROVOD_NCCL_HOME=$ENV_PREFIX
export HOROVOD_GPU_OPERATIONS=NCCL
export PIP_VERBOSE=1
conda env create -vv --prefix $ENV_PREFIX --file environment.yml --force
conda activate $ENV_PREFIX
horovodrun --check-build I got the same errors as you: horovodrun --check-build
2020-09-10 22:30:09.681037: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:09.681211: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:09.681238: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-09-10 22:30:25.038451: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:25.038626: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:25.038648: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-09-10 22:30:30.844146: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:30.844307: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:30.844338: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-09-10 22:30:36.443558: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:36.443722: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:36.443744: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-09-10 22:30:42.477443: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:42.477607: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:42.477626: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-09-10 22:30:48.617062: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:48.617224: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:48.617243: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Horovod v0.19.5:
Available Frameworks:
[X] TensorFlow
[X] PyTorch
[X] MXNet
Available Controllers:
[X] MPI
[X] Gloo
Available Tensor Operations:
[ ] NCCL
[ ] DDL
[ ] CCL
[X] MPI
[X] Gloo It turns out that the loader issues are a known problem in tf2.1, so I downgraded your (/project/WMT20/opt/miniconda3/Sockeye-2.1.21) [admin.joey.dumont@cn135 ~]$ horovodrun --check-build
Horovod v0.19.5:
Available Frameworks:
[X] TensorFlow
[X] PyTorch
[X] MXNet
Available Controllers:
[X] MPI
[X] Gloo
Available Tensor Operations:
[ ] NCCL
[ ] DDL
[ ] CCL
[X] MPI
[X] Gloo I'll try to see what are the exact requirements for NCCL to be properly detected a bit later. In the logs, I see that nccl is installed, but I don't see any relevant errors. Hope this helps. |
Thanks @joeydumont. |
I've had similar issues in the past (not finding installed software) and one thing that turned up often was that the search path that was being searched did not contain the path to the installed software. If you haven't already, you may want to check that Horovod is searching the path where NCCL is installed. Just a guess in the dark. |
As far as download issues are concerned: this afternoon and yesterday TLS/SSL transfers are stalling w/ downloads interrupted mid-way through. xfer speed - on trixie hn2 and cn101 this afternoon:Resolving files.wolframcdn.com (files.wolframcdn.com)... 152.195.19.5 0% [ ] 12,632,055 241KB/s in 52s 2020-10-01 12:19:19 (239 KB/s) - Read error at byte 12632055/1634087483 (Connection reset by peer). xfer speed - on another host around same time today:--2020-10-01 16:26:04-- https://files.wolframcdn.com/CUDA/12.1.0.0/CUDAResources-Lin64-12.1.0.paclet CUDAResources-Lin64-12.1 100%[=================================>] 1.52G 110MB/s in 14s 2020-10-01 16:26:19 (109 MB/s) - ‘CUDAResources-Lin64-12.1.0.paclet' saved [1634087483/1634087483] |
It does not appear the regular http transfers are being impact - as large file downloads from CentOS mirror site are successful. [fieldsa@cn101 ~]$ wget http://distro.ibiblio.org/centos/7.8.2003/isos/x86_64/CentOS-7-x86_64-DVD-2003.iso 100%[=======================================>] 4,781,506,560 7.11MB/s in 12m 33s 2020-10-01 15:02:09 (6.05 MB/s) - ‘CentOS-7-x86_64-DVD-2003.iso’ saved [4781506560/4781506560] |
An https transfer test was done to centos mirror - it presently fails as well, so this is not specific to certain external servers - a follow-up ticket will be sent to firewall team. [fieldsa@cn101 download-test]$ wget https://mirror.its.dal.ca/centos/7.8.2003/isos/x86_64/CentOS-7-x86_64-DVD-2003.iso 25% [=========> ] 1,210,253,046 5.19MB/s in 3m 50s 2020-10-01 15:16:34 (5.01 MB/s) - Read error at byte 1210253046/4781506560 (Connection reset by peer). |
Is this issue still valid, or can this be closed? |
I'm trying to install Sockeye with Horovod but in order to do so, I need access to the internet and access to CUDA/
nvcc
. The requirement seems to be mutually exclusive on Trixie. On the head node you have internet access but notnvcc
and on a worker node you don't have internet access but CUDA is install.Here the error message I'm seeing.
How do I get a valid SSL on a node or access to CUDA on the head node?
The text was updated successfully, but these errors were encountered: