-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supportin multiple CUDA versions? (CUDA bumps to 11.8 on the 4.2.2 images) #582
Comments
I'm busy in CUDA (keras/tensorflow) setup for aged NVIDIA Tesla K80 Azure Datacenter GPU (2014) These GPU are rather old and near to be dismissed by Microsoft, but very cheap, useful for educational vm, used by students in our institution. The NVIDIA kernel driver supported is branch 470: in
from docs, 470 should be supported in all CUDA-11.x releases But comparing repositories, the 470 (user-mode) driver is missing
I tried but it cannot work:
I also tried but I found some problems:
Maybe a possible solution for cuda-11 support for 470 branch:
|
@hute37 Thanks for digging into this, as you see, we're still figuring out the best strategy for handling CUDA versioning in this stack. I haven't had a chance to investigate here, this is a great start but we will need to dig a bit deeper still. As you know, there are at least three moving parts in the versioning scheme we need to triangulate:
obviously rocker can only directly select versions in the third category. My issue up top referenced the second of these, but I think the right solution there is to recommend the user update the host drivers, rather than attempting to support all drivers. But on to your issue: it would definitely be nice to retain support for older hardware. I'm a bit puzzled why the cuda11.1 setup is not viable here, but it may be due to how rocker/ml:4.2.1-cuda11.1 is built than due to CUDA? As you've noticed, that version and prior versions of rocker CUDA stack added CUDA libs on top of the r-ver base image using custom scripts based on nvidia's containers, while in the current cuda 11.8 script we instead use official NVIDIA cuda Ubuntu-based images as our base image. (This was because at the time, NVIDIA only provided ubuntu-18.04 base images). So rather than proliferate too many tags, it might be better if we can see about patching the cuda11.1 image correctly for this? For the |
Maybe the question is easy to formulate, but the answer is not ... Q: "Which is the latest tensorflow version that can be used (by CRAN-keras) with obsolete (470-driver line) NVIDIA gpus ?" Because of stack dependencies, the answer depends on several sub-questions:
While NVIDIA declares full support for 470 line until CUDA-12-2, in NVIDIA repositories some combinations are not supported, in particular for obsolete hardware. I'd prefer apt based installations, but maybe another installation method could fill the gaps? I would like to avoid conda/miniconda stacks because I need to interoperate with projects based on chatGPT wasn't helpful ... |
Thanks, this is definitely helpful. NVIDIA obviously isn't making it easy for us by insisting that
while at the same time insisting that
The first choice makes sense to me, in that it allows users to still run older software and newer software by staying current on their drivers. The second choice seems unfortunate, and is basically saying that if you want to use old hardware you'll be stuck on old software too. (Obviously that's financially in the interest of a company selling new hardware and may contribute to Microsoft's choice here too). So I think this also supports your formulation of the question: the only way forward on old hardware is to lock in an old version of all the software as well, including an old version of keras, tensorflow, and cuda toolkit. Does that sound accurate? Ok so now for nuts and bolts. given the above, I think it won't be viable to look for a solution that takes the default tensorflow version from current CRAN version of keras as the constraint -- it's not clear from the above that tensorflow 2.11 was intended for a driver 470 / cuda 11.1 / ubuntu 20.04 environment? I don't have a machine running the 470 drivers available, so I can't help much to check things here, but can you see about some earlier versions of tensorflow? (in particular I'm not clear on the history of the libnvinfer libs here, they may have been introduced only later?) |
It worked! One point is compatibility between (host) kernel driver from 470 line and user-mode With this setup, I could install CUDA-11-8 libraries. Another very important topic is
For instance, Tesla (Kepler) K80 has 3.7 capability level. Your hardware supports a fixed level of compute capabilities, that fixes the maximum version number of cuDNN library, supporting that GPU, that, in turn, fixes the maximum version number of TensorFlow that can be used on that system Some useful references:
A working configuration ...
Drivers and Container Toolkit: Apt installed
Manual Installation in I manually downloaded installation packages:
Also required under
Check Libraries:
A simple (python) test script:
Building a container is a different matter ... Maybe a multistage container build could be used to grab cuda parts from different base images (?) |
It worked! SourcesImage PatchTest ScriptsTest based on Tensorflow guide: "Use a GPU" ConfigurationRocker-Project
Hardware
Host OS
Image CUDA Stack
Image Python Stack
|
@hute37 hey well done, that's pretty cool! so it looks like dropping compat libraries and rolling cuDNN back to 8.6 was key? Nicely written install script, thanks for sharing! |
cuDNN compute capabilities requirements match with your GPU is critical. NVIDIA declares support for 3.5+ level (and 470 driver) for all CUDA-11 releases, but later components seem to break support. Also NVCC compiler is a requirement: I have read something about suppression of byte-code cache for some GPU, so compiler must be present to regenerate cache on the fly (with a noticeable startup delay?)
One thing is still missing: BLAS/LAPACK support, ... I tried to patch R/Rscript renviron to enable LD_PRELOAD trick to link nvblas/cublas library in front of standard OpenBlas (what about MKL?). In terms of support lines, maybe a |
To provide NVBLAS-enabled cp -a $(which R) $(which R)_
echo '#!/bin/bash' > $(which R)
echo "command -v nvidia-smi >/dev/null && nvidia-smi -L | grep 'GPU[[:space:]]\?[[:digit:]]\+' >/dev/null && export LD_PRELOAD=libnvblas.so" >> $(which R)
echo "$(which R_) \"\${@}\"" >> $(which R) cp -a $(which Rscript) $(which Rscript)_
echo '#!/bin/bash' > $(which Rscript)
echo "command -v nvidia-smi >/dev/null && nvidia-smi -L | grep 'GPU[[:space:]]\?[[:digit:]]\+' >/dev/null && export LD_PRELOAD=libnvblas.so" >> $(which Rscript)
echo "$(which Rscript_) \"\${@}\"" >> $(which Rscript) 👉 Enabled at runtime and only if ( Run some benchmarks to ensure that NVBLAS actually outperforms OpenBLAS. References: |
I found an issue in BLAS configuration. In base In the image, with nvblas/cublas enabled, system BLAS config is reset to basic (slow) libraries, but
To reset BLAS configuration I had to include in my setup:
To check, (from container bash):
References: |
@hute37 iirc, BLAS was intentionally turned off my default due to #471 , which I believe was traced to an open issue with how either Although the @eitsupi Do you think we could turn openblas config back on by default for 22.04 |
Sure. rocker-versioned2/scripts/install_python.sh Lines 43 to 50 in 8279ff1
|
This test runs without any errors under this configuration:
I didn't tested with nvblas/cublas libraries ... |
Close #471 (and related to #582 (comment)) This workaround seems to be sufficient if it is executed only on Ubuntu 20.04, since OpenBLAS on Ubuntu 22.04 does not seem to have the problem of crashing numpy.
Just wondering if we want to revisit support for multiple CUDA tags across a given/latest version of R. We bumped up to 11.8 with the R 4.2.2 / ubuntu 22.04 release, and I'm observing that it is not compatible with host platforms that might be running older CUDA drivers. (note that the host machine has to have driver versions greater than or equal to the libraries on the containers).
NVIDIA provides 11.7.0 and 11.7.1 on ubuntu-22, as well as 11.8.0 which we're using.
weirdly, I have one machine with NVIDIA Driver 470.141.03 CUDA Version: 11.4, which runs the 11.8 image fine, but a machine with slightly newer drivers : Driver Version: 515.65.01 CUDA Version: 11.7, can only run 11.7.1 but not the 11.8 dockerfiles.
welcome other experiences, I'll try and triangulate this one a bit more too.
The text was updated successfully, but these errors were encountered: