-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Fix unterminating DataTemplate deletion #2000
Conversation
Hi @pierrecregut. Thanks for your PR. I'm waiting for a metal3-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
/test metal3-centos-e2e-integration-test-main |
/test metal3-centos-e2e-feature-test-main |
@kashifest: The specified target(s) for
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/test metal3-centos-e2e-feature-test-main-features |
515943d
to
72a464f
Compare
There was a bug in the first version (the status.Indexes map must be updated even when the claim is removed so that the mapping can be accessed and then removed during delete. Even if the new version will pass the tests, it opens in fact the question that there is a latent race already existing during deletion. the indexes from getIndexes are recomputed to remove the ones corresponding to the DataClaims being removed but we cannot be sure that the corresponding Data had the time to run their reconcile loop that removes the finalizer as it needs the template to remove the IpClaims. If that is the case, the only really safe solution is closer to Thomas original proposal and last implementation alternative:
|
} | ||
m.DataTemplate.Status.Indexes[claimName] = dataObject.Spec.Index | ||
claimName := dataObject.Spec.Claim.Name | ||
statusIndexes[claimName] = dataObject.Spec.Index | ||
indexes[dataObject.Spec.Index] = claimName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indexes
and statusIndexes
maps are the same, in case indexes
dataObject.Spec.Index
is key and value isclaimName
and statusIndexes
has claimName
as key and dataObject.Spec.Index
as value. Can we not achieve what we are trying to do just by indexes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are "co-maps" if you want to access from index or from claims you will not use the same map.
- statusIndexes "replaces" m.DataTemplate.Status.Indexes. We use it to get a filtered Indexes where we only keep those for which a claim exists
- indexes is in fact used as a set (value never really used). It is used to find which index can be used when a Data is created.
We could defer the creation of the indexes map. If there is no Data creation it would be faster but unless we are careful it may be slower if there are multiple Data created. This is an optimization (or not) unrelated to the issue.
Yes, we need to check both liveClaims and liveData before deleting dataTemplate. |
@adilGhaffarDev I will do it by modifying this PR (no need for a third one). This is the safe solution |
72a464f
to
0919133
Compare
/test metal3-centos-e2e-integration-test-main |
Failure is not related to PR, running the test again. I have run multiple tests with this PR, and I don't see any leftover metal3datas, ipclaims, or metal3datatemplates. I will do more testing. Let's merge this PR and monitor the CI. Let's also backport it to 1.8 . @pierrecregut @tmmorin , thank you for working on it, also have you guys tested this on your side? is it solving your issue? /lgtm |
Lets test this several times before merging it. |
Thomas has tested the patch in the Sylva setting. It corrects some of the problems, but we encounter other issues not related to the logic of the DataTemplate controller, but rather the Data controller.
|
So we propose to add the fix for first point to this PR. We think that point 2 is rare but more general and should be addressed somewhere else. Finally 3 could be a more major refactoring that simplifies the logic but needs much more care to be implemented and can be done later. |
Thanks for the recap @pierrecregut - and yes, I agree with that what you propose. |
I have pushed fix for point 1 as a separate commit in the PR and changed the comment on UpdateData. The pair should solve #1994 except for corner cases (point 2 above). |
I've run tests with a local build made from this branch, and I observed no resource deletion issue. 👍 |
/test metal3-centos-e2e-integration-test-main |
I have tested it on my side, it's working fine. |
@pierrecregut can you squash the commits? |
Some Metal3Data and Metal3DataTemplates were not destroyed in rare circumstances due to missing reconcile events caused by two distinct problems. DataTemplate deletion has a watch over DataClaims but deletion can only be performed when Data are completely removed (finalizer executed) because Data resource requires the template. We keep the facts that Data and DataClaims are still existing. * When DataClaims exist, we do not remove the finalizer and wait for reconcile events from the watch. * If we have neither DataClaims or Data, we can safely remove the finalizer. Deletion is complete. * Otherwise we have to make a busy wait on Data deletion as there is no more claim to generate event but we still need to wait until the Data finalizers have been executed. Actively wait for DataClaim deletion: If Metal3Data is deleted before the Metal3DataClaim, we must actively wait for the claim deletion, as no other reconciliation event will be triggered. Fixes: metal3-io#1994 Co-authored-by: Pierre Crégut <[email protected]> Co-authored-by: Thomas Morin <[email protected]> Signed-off-by: Pierre Crégut <[email protected]>
@adilGhaffarDev squash done but as commits solve two distinct issues, I wouldn't have done it on my own. /test metal3-centos-e2e-integration-test-main |
@pierrecregut: The following test failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
We try to keep one commit in one PR , it helps in keeping release notes tidy. Can you update the PR description too and add info regarding the metal3data reconciliation fix. |
Yes this seems like a unnecessary dependency on m3datatemplate in our deletion workflow, I am not sure what will be the impact if change this. Lets do it in separate PR and keep it on Main and not back-port to v1.8, we will test it on Main CI, and if everything goes well it will be in v1.9. |
/override metal3-centos-e2e-integration-test-main |
@adilGhaffarDev: Overrode contexts on behalf of adilGhaffarDev: metal3-centos-e2e-integration-test-main In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/unhold |
/cherry-pick release-1.8 |
@adilGhaffarDev: new pull request created: #2007 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What this PR does / why we need it:
This PR fixes some non terminating Data/DataTemplate deletion. Some Metal3Data and Metal3DataTemplates were not destroyed in rare circumstances due to missing reconcile events caused by two distinct problems.
DataTemplate deletion has a watch over DataClaims but deletion can only be performed when Data are completely removed (finalizer executed) because Data resource requires the template.
We keep the facts that Data and DataClaims are still existing.
The other source of problem is Metal3Data deletion : if Metal3Data is deleted before the Metal3DataClaim, we must actively wait for the claim deletion, as no other reconciliation event will be triggered.
Alternative implementations
Data resources may not need the template during deletion as we want to remove every ip claim associated to it. In that case,
we only need to wait for data claims deletion and we can remove the busy wait from templates over data resources.
Fixes #1994