You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on May 27, 2024. It is now read-only.
Hey,
Going through live system configuration I have noticed, that gpu-operator-node-feature-discovery-worker-conf contains incorrect device class whitelist:
According to PCI-SIG specifications, base class 03 is Display controller, 00 subclass of 03 class is VGA-compatible controller, and 02 subclass of 03 class is 3D controller . 02 class is Network controller, with empty subclass pointing to any, 00 subclass to Ethernet controller, and 07 subclass to InfiniBand Controller.
So provided configuration with operator translates to:
With such filters it seems like gpu-operator-node-feature-discovery is configured to gather both GPU, and network data (where that should be done by https://github.com/Mellanox/network-operator, with similar issue: Mellanox/network-operator#957). In my opinion, deviceClassWhitelist should contain entries only from 03 classes (Display).
Result of this misconfiguration can be observed in logs of gpu-operator-node-feature-discovery-worker pods, it tries to gather data about both Ethernet and InfiniBand devices (which should be gathered by network-operator, not the gpu-operator. Those devices should be filtered out by deviceClassWhitelist):
kubectl logs -n gpu-operator gpu-operator-node-feature-discovery-worker-7ndj5 | head -n 5
E0526 21:58:37.810614 1 network.go:143] "failed to read net iface attribute" err="read /host-sys/class/net/eno3/speed: invalid argument" attributeName="speed"
E0526 21:58:37.811725 1 network.go:143] "failed to read net iface attribute" err="read /host-sys/class/net/ens6f0/speed: invalid argument" attributeName="speed"
E0526 21:58:37.811789 1 network.go:143] "failed to read net iface attribute" err="read /host-sys/class/net/ens6f1/speed: invalid argument" attributeName="speed"
E0526 21:58:37.812141 1 network.go:143] "failed to read net iface attribute" err="read /host-sys/class/net/ibp154s0v0/speed: invalid argument" attributeName="speed"
E0526 21:58:37.812180 1 network.go:143] "failed to read net iface attribute" err="read /host-sys/class/net/ibp154s0v1/speed: invalid argument" attributeName="speed"
Hey,
Going through live system configuration I have noticed, that
gpu-operator-node-feature-discovery-worker-conf
contains incorrect device class whitelist:According to PCI-SIG specifications, base class
03
isDisplay controller
,00
subclass of03
class isVGA-compatible controller
, and02
subclass of03
class is3D controller
.02
class isNetwork controller
, with empty subclass pointing to any,00
subclass toEthernet controller
, and07
subclass toInfiniBand Controller
.So provided configuration with operator translates to:
With such filters it seems like
gpu-operator-node-feature-discovery
is configured to gather both GPU, and network data (where that should be done by https://github.com/Mellanox/network-operator, with similar issue: Mellanox/network-operator#957). In my opinion, deviceClassWhitelist should contain entries only from 03 classes (Display).Result of this misconfiguration can be observed in logs of
gpu-operator-node-feature-discovery-worker
pods, it tries to gather data about both Ethernet and InfiniBand devices (which should be gathered bynetwork-operator
, not thegpu-operator
. Those devices should be filtered out by deviceClassWhitelist):This configuration can be found here: https://github.com/NVIDIA/gpu-feature-discovery/blob/main/deployments/helm/gpu-feature-discovery/values.yaml#L84
In my opinion,
deviceClassWhitelist
for gpu-feature-discovery should contain only0300
, and0302
entries.Thank you,
Franciszek
The text was updated successfully, but these errors were encountered: