-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMD GPU support #20
base: develop
Are you sure you want to change the base?
AMD GPU support #20
Conversation
- Bind mount /dev/dri and /dev/kfd in the rootfs - Add the amdgpu hook and install it by default - Enable amdgpu when --amdgpu is passed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @haampie, thanks for opening this PR!
The baseline implementation looks good!
I have a few questions for you:
- If I understand correctly, the code is for single GPU systems. What would happen on a multi-GPU system?
- The integration with the NVIDIA Container Toolkit does not require any additional CLI option. Is there a feature of the ROCm environment which could be leveraged to obtain a similar experience? If a user requests GPU hardware (and in this case, a specific GPU architecture) through the workload manager, ideally there should not be the need to repeat the request to the container engine.
install(FILES templates/hooks.d/09-slurm-global-sync-hook.json.in DESTINATION ${CMAKE_INSTALL_PREFIX}/etc/hooks.d) | ||
install(FILES templates/hooks.d/11-amdgpu-hook.json.in DESTINATION ${CMAKE_INSTALL_PREFIX}/etc/hooks.d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not add this hook to a default installation, mainly because it is targeted at very specific hardware, and therefore it should be explicitly chosen by the system administrator (like the MPI and NVIDIA hooks).
Another reason would be that at present we have no way to test it as part of the automated tests.
#include "AmdGpuHook.hpp" | ||
|
||
#include <vector> | ||
#include <fstream> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all the headers here needed? For example, I don't think you need fstream
, boost/regex
, and possibly others.
#define sarus_hooks_amdgpu_AmdGpuHook_hpp | ||
|
||
#include <vector> | ||
#include <unordered_map> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As pointed out for the .cpp
file above, could you check if all headers are effectively used?
Hi @Madeeks, I haven't tested this for multiple GPUs, but in principle it should work. Every GPU should should be listed in I'll think about autodetection like we have for NVIDIA GPUs, but didn't immediately know what to check. AMD likes to install |
Ok, so the way Another idea is to check if |
Let me elaborate a bit more my question about hook interface and device selection. The CUDA runtime uses the I was wondering if there were analogous variables in the ROCm environment. As an additional reference, the GRES plugin of Slurm sets |
Ah, Ault is configured such that by default you get all GPUs.
|
So,
How about we just unconditionally mount Edit: in fact I find it only confusing to mount just a few specific GPUs, because ROCR_VISIBLE_DEVICES=1,2 should then be unset or relabeled to ROCR_VISIBLE_DEVICES=0,1 inside the container:
|
Adds a hook for AMD GPUs, which currently just mounts /dev/dri and /dev/kfd as advocated by AMD.
Hook can be enabled through the following flag:
It will just fail when /dev/dri or /dev/kfd does not exist or can't be mounted.