Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync Catalog: Add k8s endpoint health state to consul health check #3874

Merged

Conversation

jukie
Copy link
Contributor

@jukie jukie commented Apr 4, 2024

Changes proposed in this PR

I recently contributed a change here from using Endpoints to EndpointSlices.
EndpointSlices offer additional state information such as readiness information but in the current approach we aren't taking advantage of this and instead only add/remove endpoints based on creation or deletions.
We can utilize the Endpoint conditions to accurately represent a registered service's health-state instead of always registering as healthy.

How I've tested this PR

  • Added tests
  • Manual build/test in live environment

How I expect reviewers to test this PR

An easy way to test would be to launch a pod with a long terminationGracePeriod that stays up and initiate a deployment rollout or delete the pod. Once the pod enters terminating state you'll see the same state in the relevant EndpointSlice.
At this point the registered consul service instance will have its health check updated to "critical" phase.

Checklist

fixes #3898

@jukie jukie force-pushed the fix/sync-catalog-ignore-terminating-endpoints branch from 2d02bf3 to e360996 Compare April 4, 2024 05:29
@jukie jukie changed the title Sync Catalog: Ignore endpoints in terminating state Sync Catalog: Ignore non-ready endpoints Apr 4, 2024
@jukie
Copy link
Contributor Author

jukie commented Apr 20, 2024

@david-yu could you possibly help get a review on this please?

@jukie
Copy link
Contributor Author

jukie commented Apr 30, 2024

bumpety bump in case anyone's watching

@ndhanushkodi
Copy link
Contributor

@jukie taking a look at this now

@ndhanushkodi ndhanushkodi mentioned this pull request May 6, 2024
2 tasks
Copy link
Contributor

@ndhanushkodi ndhanushkodi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed that the relevant acceptance tests are passing, and the changes look good to me. Just one small comment to address!

Tomorrow I want to make sure the failing peering acceptance tests are unrelated, and if you're able to make that one small test case change then I think we can merge after.

@ndhanushkodi ndhanushkodi added pr/no-changelog PR does not need a corresponding .changelog entry pr/no-backport signals that a PR will not contain a backport label labels May 7, 2024
@ndhanushkodi ndhanushkodi removed the pr/no-changelog PR does not need a corresponding .changelog entry label May 8, 2024
@ndhanushkodi
Copy link
Contributor

Hey @jukie I was about to merge yesterday but then I was thinking more about the use cases you listed in the issue, and since Consul supports health checks, should we instead use Consul healthchecks to reflect the readiness of an endpoint, rather than ignoring it entirely if its non-ready? That way the user can configure Consul as they wish based on the health checks. I talked to the team and the consensus was that it makes more sense to use the health checks to reflect the readiness status.

If you're willing to make that change, I'll keep an eye on this PR (or a new PR if you make one, feel free to tag me) for those changes, or I can bring up the changes for prioritization within the team and work on it myself (but I'm not sure when I'll be able to commit to having this done). Thanks so much for your effort on this.

@jukie
Copy link
Contributor Author

jukie commented May 9, 2024

@ndhanushkodi could you explain how that would work? I think that may be in addition to or an extension of this change vs instead of.
Since consul-k8s is already watching for updates to the EndpointSlice objects that seems like the most efficient place to react to changes in readiness in real time. If a user wants to always publish endpoints they can also set spec.publishNotReadyAddresses and that would already be supported.

@jukie
Copy link
Contributor Author

jukie commented May 9, 2024

I'd be willing to make a new PR to support consul health checks as well but I'd like to understand the opinion that it wouldn't be a separate feature contribution. The existing behavior reacts to Service Endpoint changes and this extends that to make use of the additional state provided by EndpointSlices.

@ndhanushkodi
Copy link
Contributor

I'd like to understand the opinion that it wouldn't be a separate feature contribution. The existing behavior reacts to Service Endpoint changes and this extends that to make use of the additional state provided by EndpointSlices.

Thanks for your willingness to contribute and so sorry for the late response!

The existing behavior seems to register all endpoints in the endpoint slice, regardless of the condition of that endpoint, so if I created a service where one of the endpoints went into a non-ready state, that would still get registered as a service instance in Consul. This PR would change the behavior to no longer register service instances in Consul for endpoints that are in a non-ready state. My thought with the suggestion above was we should continue to register non-ready endpoints into Consul as we do today, but instead mark the health check unhealthy. Rather than changing the behavior to not register the non-ready endpoints, I'm proposing the health check change as the way to deal with non-ready endpoints.

For the cases that use the function registerServiceInstance, I'm imagining making a change here where we change the health based on the endpoint condition rather than always marking it as passing. For Nodeport services which doesn't use registerServiceInstance, I'm thinking we'd have to manually add a check here and line 679.

Let me know if I'm misunderstanding anything you're saying, or if this makes sense.

If a user wants to always publish endpoints they can also set spec.publishNotReadyAddresses and that would already be supported.

This is for K8s Services right? Whether the endpoint addresses should exist in the K8s Endpoint itself? This change is about whether the K8s Endpoint addresses that exist should get synced to Consul if they are non-ready so I think that feature doesn't quite match the behavior I mentioned in my suggestion.

Again, let me know if I'm misunderstanding what you mean!

@jukie
Copy link
Contributor Author

jukie commented Sep 18, 2024

@ndhanushkodi Sorry for the late reply as well but coming back to this. You are correct that it would change behavior but I think that behavior matches the intention given the way that services would already be registered with details stating that "Kubernetes health checks passing" https://github.com/hashicorp/consul-k8s/blob/main/control-plane/catalog/to-consul/resource.go#L44-L48

If you'd like to include a way that replicates prior behavior for backwards compatibility I could instead add a new option that would always register an endpoint no matter what.

This change is to achieve the desired behavior of what catalog sync says it's doing. As shown in the link above when services are registered it is already assumed that they are "ready" and skipping non-ready endpoints looks to be a limitation in how Endpoints worked since that state information wasn't available. Now that EndpointSlices are in use we have more granular control and should make use of it.

@jukie
Copy link
Contributor Author

jukie commented Sep 18, 2024

Thinking more, ignoring might not be the best path but instead adjusting the health check state that's added to the registered consul service.

@jukie jukie force-pushed the fix/sync-catalog-ignore-terminating-endpoints branch from 94b6367 to 4ab7e80 Compare September 19, 2024 00:00
@jukie
Copy link
Contributor Author

jukie commented Sep 19, 2024

@ndhanushkodi I've made that adjustment to instead include endpoint health info and retain the logic of still registering all services. Could you have another look please?

@jukie jukie changed the title Sync Catalog: Ignore non-ready endpoints Sync Catalog: Add k8s endpoint health state to consul health check Sep 19, 2024
@jukie
Copy link
Contributor Author

jukie commented Sep 26, 2024

Not sure if @ndhanushkodi is still an active maintainer so adding @zalimeni as well

Copy link
Contributor

@ndhanushkodi ndhanushkodi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response I was working on some other tickets. Thanks for making these changes, I agree with this approach. I think the resource_test.go file will have failing tests (when I pulled those changes locally, it doesn't work)-- are you able to update those? Once that's in a passing state I can run our pipelines against your changes and we can get it merged

control-plane/catalog/to-consul/resource_test.go Outdated Show resolved Hide resolved
Signed-off-by: jukie <[email protected]>
@jukie
Copy link
Contributor Author

jukie commented Oct 4, 2024

@ndhanushkodi sorry for my lack of checking beforehand! Updated the tests and re-ran acceptance ones as well.

@jukie
Copy link
Contributor Author

jukie commented Oct 9, 2024

@ndhanushkodi was there any other changes you'd like to see here or is it ready to merge?

@ndhanushkodi
Copy link
Contributor

@jukie Running our pipelines in #4381

@poblahblahblah
Copy link

We are very interested in having this functionality at my company!

@ndhanushkodi ndhanushkodi merged commit 124e38b into hashicorp:main Oct 14, 2024
61 of 66 checks passed
@jukie jukie deleted the fix/sync-catalog-ignore-terminating-endpoints branch October 14, 2024 23:09
@blake blake linked an issue Oct 16, 2024 that may be closed by this pull request
jm96441n pushed a commit that referenced this pull request Oct 22, 2024
jm96441n pushed a commit that referenced this pull request Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/no-backport signals that a PR will not contain a backport label
Projects
None yet
3 participants