RDMA allocatable resources changed to 0 and couldn't be updated #109

913871734 · 2024-06-14T06:54:43Z

We found that the plugin show that the num of devices is 0. Then I started to position why it is 0.

I checked the log of the plugin, and found that there always exposing 1k devices( non-zero), which shows the plugin works normally.
I checked the source code of the plugin, and found that the plugin would count the number of devices per period, the most import logic is that the plugin will compare the number of devices with the value of the previous cycle, and only update it when the value changes and report it to kubelet. There is a situation that an error may occur during kubelet communication, causing the kubectl client made an erroneous communication which causes the client to obtain a value of 0. However, since the plugin 'update' is only pushed when the actual value changes, the client value cannot be updated normally. Finally, actually the number of devices haven't been changed, but the client's value couldn't be updated to correct value.

// ListAndWatch lists devices and update that list according to the health status
func (rs *resourceServer) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer) error {
    log.Printf("ListAndWatch called by kubelet for: %s", rs.resourceName)
    resp := new(pluginapi.ListAndWatchResponse)

    // Send initial list of devices
    if err := rs.sendDevices(resp, s); err != nil {
        return err
    }

    rs.mutex.RLock()
    err := rs.updateCDISpec()
    rs.mutex.RUnlock()
    if err != nil {
        log.Printf("cannot update CDI specs: %v", err)
        return err
    }

    for {
        select {
        case <-s.Context().Done():
            log.Printf("ListAndWatch stream close: %v", s.Context().Err())
            return nil
        case d := <-rs.health:
            // FIXME: there is no way to recover from the Unhealthy state.
            d.Health = pluginapi.Unhealthy
            _ = s.Send(&pluginapi.ListAndWatchResponse{Devices: rs.devs})
        case <-rs.updateResource:
            if err := rs.sendDevices(resp, s); err != nil {
                // The old stream may not be closed properly, return to close it
                // and pass the update event to the new stream for processing
                rs.updateResource <- true
                return err
            }
            err := rs.updateCDISpec()
            if err != nil {
                log.Printf("cannot update CDI specs: %v", err)
                return err
            }
        }
    }
}

Therefore, I suggest whether we can change the update mechanism to add a forced push mechanism. For example, when the values are the same within a specified number of cycles, a forced push update will also be performed.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RDMA allocatable resources changed to 0 and couldn't be updated #109

RDMA allocatable resources changed to 0 and couldn't be updated #109

913871734 commented Jun 14, 2024 •

edited

Loading

RDMA allocatable resources changed to 0 and couldn't be updated #109

RDMA allocatable resources changed to 0 and couldn't be updated #109

Comments

913871734 commented Jun 14, 2024 • edited Loading

913871734 commented Jun 14, 2024 •

edited

Loading