Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Fix unterminating DataTemplate deletion #2000

Merged
merged 1 commit into from
Oct 3, 2024

Conversation

pierrecregut
Copy link
Contributor

@pierrecregut pierrecregut commented Sep 26, 2024

What this PR does / why we need it:

This PR fixes some non terminating Data/DataTemplate deletion. Some Metal3Data and Metal3DataTemplates were not destroyed in rare circumstances due to missing reconcile events caused by two distinct problems.

DataTemplate deletion has a watch over DataClaims but deletion can only be performed when Data are completely removed (finalizer executed) because Data resource requires the template.

We keep the facts that Data and DataClaims are still existing.

  • When DataClaims exist, we do not remove the finalizer and wait for reconcile events from the watch.
  • If we have neither DataClaims or Data, we can safely remove the finalizer. Deletion is complete.
  • Otherwise we have to make a busy wait on Data deletion as there is no more claim to generate event but we still need to wait until the Data finalizers have been executed.

The other source of problem is Metal3Data deletion : if Metal3Data is deleted before the Metal3DataClaim, we must actively wait for the claim deletion, as no other reconciliation event will be triggered.

Alternative implementations
Data resources may not need the template during deletion as we want to remove every ip claim associated to it. In that case,
we only need to wait for data claims deletion and we can remove the busy wait from templates over data resources.

Fixes #1994

@metal3-io-bot metal3-io-bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 26, 2024
@metal3-io-bot
Copy link
Contributor

Hi @pierrecregut. Thanks for your PR.

I'm waiting for a metal3-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@metal3-io-bot metal3-io-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 26, 2024
@tuminoid
Copy link
Member

/ok-to-test

@metal3-io-bot metal3-io-bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 27, 2024
@tuminoid
Copy link
Member

/cc @kashifest @mboukhalfa

@kashifest
Copy link
Member

/test metal3-centos-e2e-integration-test-main
/test metal3-ubuntu-e2e-integration-test-main

@kashifest
Copy link
Member

/test metal3-centos-e2e-feature-test-main

@metal3-io-bot
Copy link
Contributor

@kashifest: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test build
  • /test generate
  • /test gomod
  • /test manifestlint
  • /test markdownlint
  • /test metal3-centos-e2e-integration-test-main
  • /test metal3-ubuntu-e2e-integration-test-main
  • /test shellcheck
  • /test test
  • /test unit

The following commands are available to trigger optional jobs:

  • /test metal3-centos-e2e-basic-test-main
  • /test metal3-centos-e2e-feature-test-main-features
  • /test metal3-centos-e2e-feature-test-main-pivoting
  • /test metal3-centos-e2e-feature-test-main-remediation
  • /test metal3-e2e-1-29-1-30-upgrade-test-main
  • /test metal3-e2e-clusterctl-upgrade-test-main
  • /test metal3-ubuntu-e2e-basic-test-main
  • /test metal3-ubuntu-e2e-feature-test-main-features
  • /test metal3-ubuntu-e2e-feature-test-main-pivoting
  • /test metal3-ubuntu-e2e-feature-test-main-remediation

Use /test all to run the following jobs that were automatically triggered:

  • generate
  • gomod
  • manifestlint
  • unit

In response to this:

/test metal3-centos-e2e-feature-test-main

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kashifest
Copy link
Member

/test metal3-centos-e2e-feature-test-main-features
/test metal3-centos-e2e-feature-test-main-pivoting
/test metal3-centos-e2e-feature-test-main-remediation

@metal3-io-bot metal3-io-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 27, 2024
@pierrecregut
Copy link
Contributor Author

There was a bug in the first version (the status.Indexes map must be updated even when the claim is removed so that the mapping can be accessed and then removed during delete. Even if the new version will pass the tests, it opens in fact the question that there is a latent race already existing during deletion. the indexes from getIndexes are recomputed to remove the ones corresponding to the DataClaims being removed but we cannot be sure that the corresponding Data had the time to run their reconcile loop that removes the finalizer as it needs the template to remove the IpClaims. If that is the case, the only really safe solution is closer to Thomas original proposal and last implementation alternative:

  • compute boolean liveData as len(indexes) > 0 where indexes is the result of getIndexes() without the changes applied during updateData().
  • compute boolean liveClaims to be true iff there is a claim listed with delete timestamp set to 0.
  • if liveClaims is true, just return without requeue as we wait for new events from the claims (and we may keep the template for a long time if nobody delete the claims).
  • if liveClaims and liveData are both false, remove the finalizer, the deletion is finished
  • if liveClaims is false and liveData is true, return with an explicit requeue. Requeue should be avoided but we must perform an active wait until the Data has been deleted but as we are waiting for finalizers, it should be relatively quick...

}
m.DataTemplate.Status.Indexes[claimName] = dataObject.Spec.Index
claimName := dataObject.Spec.Claim.Name
statusIndexes[claimName] = dataObject.Spec.Index
indexes[dataObject.Spec.Index] = claimName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indexes and statusIndexes maps are the same, in case indexes dataObject.Spec.Index is key and value isclaimName and statusIndexes has claimName as key and dataObject.Spec.Index as value. Can we not achieve what we are trying to do just by indexes?

Copy link
Contributor Author

@pierrecregut pierrecregut Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are "co-maps" if you want to access from index or from claims you will not use the same map.

  • statusIndexes "replaces" m.DataTemplate.Status.Indexes. We use it to get a filtered Indexes where we only keep those for which a claim exists
  • indexes is in fact used as a set (value never really used). It is used to find which index can be used when a Data is created.

We could defer the creation of the indexes map. If there is no Data creation it would be faster but unless we are careful it may be slower if there are multiple Data created. This is an optimization (or not) unrelated to the issue.

@adilGhaffarDev
Copy link
Member

There was a bug in the first version (the status.Indexes map must be updated even when the claim is removed so that the mapping can be accessed and then removed during delete. Even if the new version will pass the tests, it opens in fact the question that there is a latent race already existing during deletion. the indexes from getIndexes are recomputed to remove the ones corresponding to the DataClaims being removed but we cannot be sure that the corresponding Data had the time to run their reconcile loop that removes the finalizer as it needs the template to remove the IpClaims. If that is the case, the only really safe solution is closer to Thomas original proposal and last implementation alternative:

* compute boolean liveData as len(indexes) > 0 where indexes is the result of getIndexes() without the changes applied during updateData().

* compute boolean liveClaims to be true iff there is a claim listed with delete timestamp set to 0.

* if liveClaims is true, just return without requeue as we wait for new events from the claims (and we may keep the template for a long time if nobody delete the claims).

* if liveClaims and liveData are both false, remove the finalizer, the deletion is finished

* if liveClaims is false and liveData is true, return with an explicit requeue. Requeue should be avoided but we _must_ perform an active wait until the Data has been deleted but as we are waiting for finalizers, it should be relatively quick.

Yes, we need to check both liveClaims and liveData before deleting dataTemplate.
I have tested Thomas's PR, and after provisioning and deprovisiong worker nodes for few times, I started seeing leftover metal3Datas and ipclaims.
I agree with the solution that you are proposing. If you have time please open a PR or update this one, if not I can work on it.

@pierrecregut
Copy link
Contributor Author

@adilGhaffarDev I will do it by modifying this PR (no need for a third one). This is the safe solution

@metal3-io-bot metal3-io-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 27, 2024
@pierrecregut
Copy link
Contributor Author

/test metal3-centos-e2e-integration-test-main
/test metal3-ubuntu-e2e-integration-test-main

@adilGhaffarDev
Copy link
Member

@pierrecregut: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
Test name Commit Details Required Rerun command
metal3-centos-e2e-feature-test-main-features 515943d link false /test metal3-centos-e2e-feature-test-main-features
metal3-centos-e2e-integration-test-main 0919133 link true /test metal3-centos-e2e-integration-test-main

Failure is not related to PR, running the test again.
/test metal3-centos-e2e-integration-test-main

I have run multiple tests with this PR, and I don't see any leftover metal3datas, ipclaims, or metal3datatemplates. I will do more testing. Let's merge this PR and monitor the CI. Let's also backport it to 1.8 .

@pierrecregut @tmmorin , thank you for working on it, also have you guys tested this on your side? is it solving your issue?

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 30, 2024
@adilGhaffarDev
Copy link
Member

cc @kashifest @mboukhalfa

@Sunnatillo
Copy link
Member

Lets test this several times before merging it.

@pierrecregut
Copy link
Contributor Author

Thomas has tested the patch in the Sylva setting. It corrects some of the problems, but we encounter other issues not related to the logic of the DataTemplate controller, but rather the Data controller.

  • For context, in the Sylva test run, the namespace of the cluster is deleted before all the CAPI resources for the cluster have been removed. In this case, all objects are deleted in random order. If a Metal3Data is deleted before the claim, the deletion reconciliation is aborted but not requeued. When the claim is deleted, garbage collection due to the owner reference may be triggered, but the Data is already marked as deleted, so nothing happens. If err is nil, a requeue must be generated.

  • On rare occasions, the deferred patch fails. Unfortunately, the error is ignored, so no further reconciliation will be attempted if there was no error. The fix is simple: assign err to rerr if it is non-nil. This pattern is present in seven controllers; only the remediation controller is correct.

  • Finally, we have a design question regarding how Data is deleted. The delete operation lists the IP pools used in the template to find the IPClaim based on the labels, both on the Data name and the IP pool name. However, do we need to select based on the IP pool name if we will browse over all potential values? If we delete all the IPClaims with the correct label Data, then we do not need the template. Consequently, we do not need to block template deletion until the Data disappears, which would help us resolve the conflict of waiting for DataClaims and counting Data during template deletion.

@pierrecregut
Copy link
Contributor Author

So we propose to add the fix for first point to this PR. We think that point 2 is rare but more general and should be addressed somewhere else. Finally 3 could be a more major refactoring that simplifies the logic but needs much more care to be implemented and can be done later.

@tmmorin
Copy link
Contributor

tmmorin commented Oct 2, 2024

Thanks for the recap @pierrecregut - and yes, I agree with that what you propose.

@metal3-io-bot metal3-io-bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 2, 2024
@pierrecregut
Copy link
Contributor Author

pierrecregut commented Oct 2, 2024

I have pushed fix for point 1 as a separate commit in the PR and changed the comment on UpdateData. The pair should solve #1994 except for corner cases (point 2 above).

@pierrecregut
Copy link
Contributor Author

point 2 is issue #2002 for triaging with associated PR #2003

@tmmorin
Copy link
Contributor

tmmorin commented Oct 3, 2024

I've run tests with a local build made from this branch, and I observed no resource deletion issue. 👍

@pierrecregut
Copy link
Contributor Author

/test metal3-centos-e2e-integration-test-main
/test metal3-ubuntu-e2e-integration-test-main

@adilGhaffarDev
Copy link
Member

I have tested it on my side, it's working fine.
/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 3, 2024
@adilGhaffarDev
Copy link
Member

@pierrecregut can you squash the commits?

Some Metal3Data and Metal3DataTemplates were not destroyed
in rare circumstances due to missing reconcile events caused
by two distinct problems.

DataTemplate deletion has a watch over DataClaims but deletion
can only be performed when Data are completely removed (finalizer
executed) because Data resource requires the template.

We keep the facts that Data and DataClaims are still existing.
* When DataClaims exist, we do not remove the finalizer and wait
  for reconcile events from the watch.
* If we have neither DataClaims or Data, we can safely
  remove the finalizer. Deletion is complete.
* Otherwise we have to make a busy wait on Data deletion
  as there is no more claim to generate event but we still
  need to wait until the Data finalizers have been executed.

Actively wait for DataClaim deletion: If Metal3Data is deleted
before the Metal3DataClaim, we must actively wait for the claim
deletion, as no other reconciliation event will be triggered.

Fixes: metal3-io#1994
Co-authored-by: Pierre Crégut <[email protected]>
Co-authored-by: Thomas Morin <[email protected]>
Signed-off-by: Pierre Crégut <[email protected]>
@metal3-io-bot metal3-io-bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 3, 2024
@pierrecregut
Copy link
Contributor Author

@adilGhaffarDev squash done but as commits solve two distinct issues, I wouldn't have done it on my own.

/test metal3-centos-e2e-integration-test-main
/test metal3-ubuntu-e2e-integration-test-main

@metal3-io-bot
Copy link
Contributor

metal3-io-bot commented Oct 3, 2024

@pierrecregut: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
metal3-centos-e2e-feature-test-main-features 515943d link false /test metal3-centos-e2e-feature-test-main-features

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@adilGhaffarDev
Copy link
Member

@adilGhaffarDev squash done but as commits solve two distinct issues, I wouldn't have done it on my own.

We try to keep one commit in one PR , it helps in keeping release notes tidy. Can you update the PR description too and add info regarding the metal3data reconciliation fix.

@adilGhaffarDev
Copy link
Member

  • Finally, we have a design question regarding how Data is deleted. The delete operation lists the IP pools used in the template to find the IPClaim based on the labels, both on the Data name and the IP pool name. However, do we need to select based on the IP pool name if we will browse over all potential values? If we delete all the IPClaims with the correct label Data, then we do not need the template. Consequently, we do not need to block template deletion until the Data disappears, which would help us resolve the conflict of waiting for DataClaims and counting Data during template deletion.

Yes this seems like a unnecessary dependency on m3datatemplate in our deletion workflow, I am not sure what will be the impact if change this. Lets do it in separate PR and keep it on Main and not back-port to v1.8, we will test it on Main CI, and if everything goes well it will be in v1.9.
For v1.8 this PR is good to go in.

@adilGhaffarDev
Copy link
Member

/override metal3-centos-e2e-integration-test-main
the test was already passed before the squash no need to run again.
/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 3, 2024
@metal3-io-bot
Copy link
Contributor

@adilGhaffarDev: Overrode contexts on behalf of adilGhaffarDev: metal3-centos-e2e-integration-test-main

In response to this:

/override metal3-centos-e2e-integration-test-main
the test was already passed before the squash no need to run again.
/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@adilGhaffarDev
Copy link
Member

/unhold
Tests passed I am unholding.

@metal3-io-bot metal3-io-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 3, 2024
@metal3-io-bot metal3-io-bot merged commit f3d7f61 into metal3-io:main Oct 3, 2024
18 checks passed
@adilGhaffarDev
Copy link
Member

/cherry-pick release-1.8

@metal3-io-bot
Copy link
Contributor

@adilGhaffarDev: new pull request created: #2007

In response to this:

/cherry-pick release-1.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metal3DataTemplate deletion stuck, finalizer never cleared
7 participants