-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Update controller logic to handle stale SriovNetworkNodeState CRs with delay #798
feat: Update controller logic to handle stale SriovNetworkNodeState CRs with delay #798
Conversation
Thanks for your PR,
To skip the vendors CIs, Maintainers can use one of:
|
Pull Request Test Coverage Report for Build 12122609914Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
// keep until time annotation is not set, set a new value with default or configured offset and update the object | ||
delayMinutes, err := strconv.Atoi(os.Getenv("STALE_NODE_STATE_CLEANUP_DELAY")) | ||
if err != nil || delayMinutes <= 0 { | ||
delayMinutes = 30 // keep objects for 30 minutes by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add some logging here to indicate env variable is incorrect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we've got unused CleanupTimeout [1] variable. maybe we need to change it's value and use it here
[1]
sriov-network-operator/test/util/util.go
Line 34 in 68b6c02
CleanupTimeout = time.Second * 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not use this one because it is in "test" package. But I agree that we can create constant with default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added constant to hold the default value
@@ -27,6 +27,9 @@ operator: | |||
resourcePrefix: "openshift.io" | |||
cniBinPath: "/opt/cni/bin" | |||
clusterType: "kubernetes" | |||
# minimal amount of time (in minutes) the operator will wait before removing | |||
# stale SriovNetworkNodeState objects (objects that doesn't match node with the daemon) | |||
staleNodeStateCleanupDelay: "30" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: staleNodeStateCleanupDelayMinutes ? or is that too long in your opinion ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm from my side, added small comment.
once Ivan's comment addressed (we should use a constant IMO) im LGTM.
err := r.Delete(ctx, &ns, &client.DeleteOptions{}) | ||
if err != nil { | ||
logger.Error(err, "Fail to Delete", "SriovNetworkNodeState CR:", ns.GetName()) | ||
if err := r.handleStaleNodeState(ctx, &ns); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: this is being called every ~5 min (resync period) worst case or when policy is updated or when policy/node changed
CI failure is not related to the change. The same failure occurs on the PR with dummy changes #800 |
4245aa3
to
13dc502
Compare
@e0ne @adrianchiris I addressed your comments. I also changed behavior a bit to completely avoid any delay in case if |
13dc502
to
994ddbf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for addressing my comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
1898df7
to
4ea6ce0
Compare
…Rs with delay - Changed the logic in the sriov-network-operator controller to handle stale SriovNetworkNodeState CRs (those with no matching Nodes with daemon). - Introduced a delay (30 minutes by default) before removing stale state CRs to manage scenarios where the user temporarily removes the daemon from the node but does not want to lose the state stored in the SriovNetworkNodeState. - Added the `STALE_NODE_STATE_CLEANUP_DELAY_MINUTES` environment variable to configure the required delay in minutes (default is 30 minutes).
4ea6ce0
to
5ad4ae9
Compare
@SchSeba PTAL on this one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
nice work!
Update controller logic to handle stale SriovNetworkNodeState CRs with delay
STALE_NODE_STATE_CLEANUP_DELAY_MINUTES
environment variable to configure the required delay in minutes (default is 30 minutes).This functionality especially useful when the OFED container is in use. As the OFED driver loads on the host, the sriov-config-daemon is removed from this node (achieved using configDaemon nodeselector). Since loading the driver can take a considerable amount of time, we want to ensure that the SriovNetworkNodeState is not lost during this process.