Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RedHat9.2 exec k8s-driver-manager error #37

Closed
lengrongfu opened this issue Aug 16, 2024 · 7 comments
Closed

RedHat9.2 exec k8s-driver-manager error #37

lengrongfu opened this issue Aug 16, 2024 · 7 comments

Comments

@lengrongfu
Copy link

I use gpu-operator:v23.9.0 to install nvidia gpu driver, but nvidia-driver-daemonset pod start after, the machine will kernel crash.

I use GPU car is Tesla P4.

os info: Red Hat9.2, kernel version is 5.14.0-284.11.1.el9_2.x86_64.

machine is install nouveau driver, and i use dmesg command to look kernel log, found having many error about nouveau:
image

@lengrongfu
Copy link
Author

@cdesiniotis Have you seen this error?

@cdesiniotis
Copy link
Collaborator

@lengrongfu I am not familiar. It is recommended to blacklist nouveau as it can conflict with the nvidia driver.

@lengrongfu
Copy link
Author

lengrongfu commented Aug 22, 2024

I am using gpu-operator to install the driver. Do I need to manually add nouveau to the blacklist before installing gpu-operator?

k8s-driver-manager pod exec rmmod nouveau error.

rmmod nouveau

@cdesiniotis
Copy link
Collaborator

I am using gpu-operator to install the driver. Do I need to manually add nouveau to the blacklist before installing gpu-operator?

This is not a required pre-requisite, but because you are seeing errors from nouveau I recommended that you try blacklisting it. Like you pointing out, we do take care of unloaded in the module.

@lengrongfu
Copy link
Author

Ok, thanks, i exec blacklist nouveau after, k8s-driver-manager can exec success,

@lengrongfu
Copy link
Author

@cdesiniotis Let's discuss whether it is possible to develop a new feature to add an option to k8s-driver-manager to perform the operation of blacklist nouveau

@cdesiniotis
Copy link
Collaborator

Since blacklisting would require updating the initramfs and rebooting the node, it is not something we would be open to adding to this component. This should be done during infrastructure provisioning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants