Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDMA allocatable resources changed to 0 and couldn't be updated #109

Open
913871734 opened this issue Jun 14, 2024 · 0 comments
Open

RDMA allocatable resources changed to 0 and couldn't be updated #109

913871734 opened this issue Jun 14, 2024 · 0 comments

Comments

@913871734
Copy link

913871734 commented Jun 14, 2024

We found that the plugin show that the num of devices is 0. Then I started to position why it is 0.

  1. I checked the log of the plugin, and found that there always exposing 1k devices( non-zero), which shows the plugin works normally.
    image

  2. I checked the source code of the plugin, and found that the plugin would count the number of devices per period, the most import logic is that the plugin will compare the number of devices with the value of the previous cycle, and only update it when the value changes and report it to kubelet. There is a situation that an error may occur during kubelet communication, causing the kubectl client made an erroneous communication which causes the client to obtain a value of 0. However, since the plugin 'update' is only pushed when the actual value changes, the client value cannot be updated normally. Finally, actually the number of devices haven't been changed, but the client's value couldn't be updated to correct value.

image
image

// ListAndWatch lists devices and update that list according to the health status
func (rs *resourceServer) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer) error {
    log.Printf("ListAndWatch called by kubelet for: %s", rs.resourceName)
    resp := new(pluginapi.ListAndWatchResponse)

    // Send initial list of devices
    if err := rs.sendDevices(resp, s); err != nil {
        return err
    }

    rs.mutex.RLock()
    err := rs.updateCDISpec()
    rs.mutex.RUnlock()
    if err != nil {
        log.Printf("cannot update CDI specs: %v", err)
        return err
    }

    for {
        select {
        case <-s.Context().Done():
            log.Printf("ListAndWatch stream close: %v", s.Context().Err())
            return nil
        case d := <-rs.health:
            // FIXME: there is no way to recover from the Unhealthy state.
            d.Health = pluginapi.Unhealthy
            _ = s.Send(&pluginapi.ListAndWatchResponse{Devices: rs.devs})
        case <-rs.updateResource:
            if err := rs.sendDevices(resp, s); err != nil {
                // The old stream may not be closed properly, return to close it
                // and pass the update event to the new stream for processing
                rs.updateResource <- true
                return err
            }
            err := rs.updateCDISpec()
            if err != nil {
                log.Printf("cannot update CDI specs: %v", err)
                return err
            }
        }
    }
}

Therefore, I suggest whether we can change the update mechanism to add a forced push mechanism. For example, when the values ​​are the same within a specified number of cycles, a forced push update will also be performed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant