You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am running the jax container on a GH200 cluster. The cluster maintainer would like to keep CUDA kernel driver at v12.2.
When running the jax-toolbox nightly container, fused_attention in transformer engine raise exception of unsupported PTX.
I am trying to resolve the problem and wonder if it is possible to enable CUDA Forward Compatibility mode on the container?
Thanks in advance!
The text was updated successfully, but these errors were encountered:
In principle the forward compatibility packages are installed in the ghcr.io/nvidia/jax:XXX containers.
If you run nvidia-smi inside/outside the container, what CUDA versions does it show?
If it shows the older 12.2 version in both places, it might be that you are not using the NVIDIA container toolkit (https://docs.nvidia.com/deploy/cuda-compatibility/index.html#frequently-asked-questions), or that some manual LD_LIBRARY_PATH changes or directories mounted in from the host system are interfering. You can check which libcuda.so* libraries/symlinks are appearing inside your container with something like find / -name 'libcuda.so*'. In this configuration (current nightlies use CUDA 12.5 containers, your 12.2 driver is older) then libcuda.so.555.42.02 (from the compat package) should be being used.
If it still doesn't work, please provide more details of the cluster environment.
Hello,
I am running the jax container on a GH200 cluster. The cluster maintainer would like to keep CUDA kernel driver at v12.2.
When running the jax-toolbox nightly container,
fused_attention
in transformer engine raise exception of unsupported PTX.I am trying to resolve the problem and wonder if it is possible to enable CUDA Forward Compatibility mode on the container?
Thanks in advance!
The text was updated successfully, but these errors were encountered: