Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Do not emit health_status event for each health check attempt. #24005

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

harish2704
Copy link

Emit health_status event only if there is a change in health_status
Fixes #24003

Does this PR introduce a user-facing change?

Emit `health_status` event only when there a status/state change for `healthy` flag

Copy link
Contributor

openshift-ci bot commented Sep 18, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: harish2704
Once this PR has been reviewed and has the lgtm label, please assign baude for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

Cockpit tests failed for commit 2771216d55d7fec0e86642e774fdc73f59d4a409. @martinpitt, @jelly, @mvollmer please check.

@martinpitt
Copy link
Contributor

This "breaks" cockpit-podman's testHealthcheckSystem, which assumes the current behavior of getting regular "health check ran and passed" events. I don't have a good gut feeling whether the regular "pings of life" are expected/by design or considered noise. So I'll defer to the review of the podman developers, and if this change is approved, I'll adjust cockpit-podman's tests.

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a good gut feeling whether the regular "pings of life" are expected/by design or considered noise

I don't think there is a formal design on how it should work but the way it works today is how the current users will expect it to work. The fact that it breaks cockpit testing is a good sign that we have users depending on it. As such I consider this a breaking change that is not suitable for a minor version so this would have to wait for podman 6.0 if we want to do that at all. I think reducing the event spam is a good idea in general.

Now one thing we should consider if docker doesn't behave this way our docker compat api should not behave this way either. One way would be to a new field to the health_status event that is set when the status changed and then we can filter out the events that did not have this set to make the docker clients work correctly at least.

cc @mheon @Honny1 In case you have opinions as you have been working on other healthcheck events work

libpod/healthcheck.go Outdated Show resolved Hide resolved
Copy link
Member

@Honny1 Honny1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not, reduce the noise in the logs, but I would wait for Podman 6.0. I'd also be in favor of adding a flag to enable log "saving" for each run, for podman healtcheck run (probably also for podman run), for debugging purposes or if something goes wrong.

@harish2704
Copy link
Author

@Luap99

I don't think there is a formal design on how it should work but the way it works today is how the current users will expect it to work.

I have to agree on this because I searched for such a documentation ( https://docs.docker.com/reference/api/engine/version/v1.41/#tag/System/operation/SystemEvents ) and there is no description about exact behavior of those events . They have only provided the list of valid events

The fact that it breaks cockpit testing is a good sign that we have users depending on it. As such I consider this a breaking change that is not suitable for a minor version so this would have to wait for podman 6.0 if we want to do that at all

IMHO, Podman has a huge opportunity as secure docker replacement and in that sense most of the users ( and applications targeting docker eg: Traefik ) will expect it work similar to docker. To address the user base who is looking to switch to Podman, it should be considered as a bug. Please note that, here i am not referring the podman cli tool. I am talking about docker compatible API which podman provides.

Now one thing we should consider if docker doesn't behave this way our docker compat api should not behave this way either. One way would be to a new field to the health_status event that is set when the status changed and then we can filter out the events that did not have this set to make the docker clients work correctly at least.

Exactly. This is the point I was trying to convey. I checked how podman-docker works but it is found to be a simple bash wrapper script for podman cli ( at-least in my Fedora-40 ) .
So, If my above understanding is correct ( ie, podman-docker is just an alias/wrapper of podman ) then, podman-docker is not exposing a separate docker compatible API ( either via unix socket or via TCP ) .
All we have is an API exposed by podman and it is expected to be docker compatible.

So, in short, my bug report is not about behavior of podma-cli tool. it is about the docker-compatible API which podman exposes.
if any of you can provide some hints on how to fix this issue without changing behaviour podman cli I am happy to try that. To be specific, my question is , at which point, Golang event is converted to HTTP API data?
I will try to add the additional flag as you mentioned in the comment

@mheon
Copy link
Member

mheon commented Sep 19, 2024

IMO, this should not be default. I would not mind adding a config field in containers.conf to enable this more minimal events output but we should not (and, as @Luap99 pointed out, cannot without a major version) do this by default.

@Luap99
Copy link
Member

Luap99 commented Sep 19, 2024

@harish2704 Our API socket is split in two parts the normal docker api endpoint and the our libpod endpoints that all start with /version/libpod/... so all the other docker compatiable endpoints can and should be changed to match docker api as closely as possible.

Look into pkg/api/handlers/compat/events.go there we use the code for both endpoints but if you look there into the logic you will find !utils.IsLibpodRequest(r) usage there so based on that you can change the behavior there. Now of course because the event stream cannot known when such a healthcheck state change happens you need to add this info into the event itself so that you can filter it there based on that which is why I suggest added this field or attribute to the event type.

Copy link

Cockpit tests failed for commit 5b8d32d26bc1eebc45d82b40e9866a0459a6a500. @martinpitt, @jelly, @mvollmer please check.

@github-actions github-actions bot added the kind/api-change Change to remote API; merits scrutiny label Sep 20, 2024
Copy link

Cockpit tests failed for commit 21b6aa68c0f65c4385fad0941c08fa14e9de8dfb. @martinpitt, @jelly, @mvollmer please check.

@harish2704 harish2704 force-pushed the fix-reduce-duplicate-health-check-events branch from 21b6aa6 to 3ec4a74 Compare September 20, 2024 19:11
Emit event only if there is a change in health_status
Fixes containers#24003

Resolves containers#24005 (comment)

Pass additional isChanged flag to event creation function

Fix health check events for docker api
Signed-off-by: Harish Karumuthil <[email protected]>
@harish2704 harish2704 force-pushed the fix-reduce-duplicate-health-check-events branch from 3ec4a74 to cecdca7 Compare September 20, 2024 19:12
@harish2704
Copy link
Author

@Luap99
I made the changes as you suggested. In short, they are

  1. Added extra flag to to Event struct to indicate whether health status changed or not
  2. Marshal & parse that flag while writing to Journald log
  3. Check that flag before writing/publishing an even in docker compatible API. Other APIs are not touched.

ie, if we run podman events we can see the old normal behavior of too many logs
but if we run env DOCKER_HOST=unix://<path-to-podman.sock> docker logs then we can not see to much logs for health check. Logs are emitted only when there is a state change

Signed-off-by: Harish Karumuthil <[email protected]>
Copy link

Cockpit tests failed for commit 3ec4a744dbd9f95e5667b242f58ec110dc4049bc. @martinpitt, @jelly, @mvollmer please check.

Copy link

Cockpit tests failed for commit cecdca7. @martinpitt, @jelly, @mvollmer please check.

Copy link

Cockpit tests failed for commit 4597408. @martinpitt, @jelly, @mvollmer please check.

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need some API level tests to ensure it working, have a look at test/apiv2

Comment on lines +62 to +63
// Whether state change happened for HealthStatus or not
IsHealthStatusChanged bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be at least omitempty like other fields so it doesn't show up on non hc events. Maybe it would be better to add this into into the Attributes map instead, @mheon WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By reading the code, I understand that attribute map is fully allocated to storing container's labels
That is why I decided not to choose that. ( Initially I thought to store this in attributes map )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Luap99 , @mheon : Please let me know what is the correct place to store these data. I will rebase, make changes & test my changes accordingly

func (c *Container) newContainerHealthCheckEvent(healthStatus string) {
if err := c.newContainerEventWithInspectData(events.HealthStatus, healthStatus, false); err != nil {
func (c *Container) newContainerHealthCheckEvent(healthStatus string, isHcStatusChanged bool) {
if err := c.newContainerEventWithInspectData(events.HealthStatus, events.EventMetadata{HealthStatus: healthStatus, IsHealthStatusChanged: isHcStatusChanged}, false); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause conflicts with #23900, not sure what is best here and which one should be merged first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would merge #23900 first. I think this PR will need more care.

@@ -93,6 +93,10 @@ func GetEvents(w http.ResponseWriter, r *http.Request) {
if evt == nil {
continue
}
if evt.Status == events.HealthStatus && !evt.IsHealthStatusChanged{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a !utils.IsLibpodRequest(r) check like the other docker compat changes below as this function is used for both the docker and libpod endpoint.

@openshift-merge-robot
Copy link
Collaborator

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 29, 2024
Copy link

A friendly reminder that this PR had no activity for 30 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/api-change Change to remote API; merits scrutiny needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note stale-pr
Projects
None yet
Development

Successfully merging this pull request may close these issues.

health_status events are too noisy/redundant
6 participants