Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add liveness/readiness probes #140

Open
goern opened this issue Sep 25, 2020 · 29 comments
Open

add liveness/readiness probes #140

goern opened this issue Sep 25, 2020 · 29 comments
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. hacktoberfest Issues targeting the hacktoberfest participants. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/devsecops Categorizes an issue or PR as relevant to SIG DevSecOps. thoth/group-programming This issue could be used for group programming, offer or request. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@goern
Copy link
Member

goern commented Sep 25, 2020

Is your feature request related to a problem? Please describe.
add a liveness/readiness probe to faust producer deployment, so that we can check if package-update is working.

Describe the solution you'd like

  • add a readiness probe that is ok if package-update is exposing metrics
  • add a liveness probe that is ok if package-update is connected to graph and kafka broker
  • rethink if this is good

Describe alternatives you've considered
no probes

Additional context
we should be able to do a basic testing if a new version of package-update is deployable and runnable.

@goern goern added enhancement good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. hacktoberfest Issues targeting the hacktoberfest participants. thoth/group-programming This issue could be used for group programming, offer or request. kind/feature Categorizes issue or PR as related to a new feature. labels Sep 25, 2020
@fridex
Copy link
Contributor

fridex commented Sep 25, 2020

This can be generalized to any message producer in deployment.

@fridex
Copy link
Contributor

fridex commented Sep 25, 2020

https://srcco.de/posts/kubernetes-liveness-probes-are-dangerous.html

@fridex
Copy link
Contributor

fridex commented Sep 25, 2020

This can be generalized to any message producer in deployment.

Except for user-facing api.

@eagleusb
Copy link

Hi there 👋, I'm willing to help you regarding to adding liveness and readiness to package-update-job CronJob in your OKD cluster.

Do you have 5 minutes to onboard me on your release management process ?
Additionally where does the package-update-job Kubernetes recipe live ?

Thanks,
Leslie

@fridex
Copy link
Contributor

fridex commented Sep 25, 2020

Hi!

thanks for your interest!

We deploy from https://github.com/thoth-station/thoth-application repo, you can find package-update related bits in https://github.com/thoth-station/thoth-application/blob/master/package-update/base/cronjob.yaml

There is already existing liveness probe that could be changed:

https://github.com/thoth-station/thoth-application/blob/c29bfd2334bd9d1c63137938932699b954fd47d9/package-update/base/cronjob.yaml#L91-L96

It's worth to consider if the liveness probe should not be more sophisticated. See also related discussion at robinhood/faust#286.

Sadly we do not have any public instance, accessible to let you test your changes but we can definitely cooperate.

F.

CC @KPostOffice @saisankargochhayat @pacospace

@sesheta
Copy link
Member

sesheta commented Apr 29, 2021

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@sesheta sesheta added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2021
@fridex
Copy link
Contributor

fridex commented Apr 30, 2021

/remove-lifecycle stale

@sesheta sesheta removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2021
@KPostOffice
Copy link
Member

@fridex package-update is a cronjob, do probes make sense here?

@fridex
Copy link
Contributor

fridex commented Jun 9, 2021

@fridex package-update is a cronjob, do probes make sense here?

It makes sense to have a mechanism to kill the pod if it failed for some reason. Previously, we had issues from time to time, that a pod was stuck in pending state or in running state (but python interpreter was not running) due to some cluster issue. To prevent that, it might be good idea to configure activeDeadlineSeconds, also for other cronjobs we have.

@sesheta
Copy link
Member

sesheta commented Jul 15, 2021

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

@sesheta sesheta added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 15, 2021
@sesheta
Copy link
Member

sesheta commented Aug 24, 2021

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

@sesheta sesheta closed this as completed Aug 24, 2021
@sesheta
Copy link
Member

sesheta commented Aug 24, 2021

@sesheta: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fridex
Copy link
Contributor

fridex commented Aug 24, 2021

/remove-lifecycle rotten
/reopen
/triage accepted

@sesheta sesheta added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Aug 24, 2021
@sesheta sesheta reopened this Aug 24, 2021
@sesheta
Copy link
Member

sesheta commented Aug 24, 2021

@fridex: Reopened this issue.

In response to this:

/remove-lifecycle rotten
/reopen
/triage accepted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sesheta sesheta removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 24, 2021
@sesheta
Copy link
Member

sesheta commented Sep 23, 2021

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

@sesheta
Copy link
Member

sesheta commented Sep 23, 2021

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

@sesheta sesheta added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 23, 2021
@sesheta sesheta closed this as completed Sep 23, 2021
@sesheta
Copy link
Member

sesheta commented Sep 23, 2021

@sesheta: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@codificat
Copy link
Member

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

The bot was misbehaving at that time, as the issue was not flagged as rotten at that point. Fixing and adding a priority

/reopen
/remove-lifecycle rotten
/priority backlog

@sesheta
Copy link
Member

sesheta commented Oct 29, 2021

@codificat: Reopened this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

The bot was misbehaving at that time, as the issue was not flagged as rotten at that point. Fixing and adding a priority

/reopen
/remove-lifecycle rotten
/priority backlog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sesheta sesheta added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Oct 29, 2021
@sesheta sesheta reopened this Oct 29, 2021
@sesheta sesheta removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 29, 2021
@sesheta
Copy link
Member

sesheta commented Jan 27, 2022

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@sesheta sesheta added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2022
@fridex
Copy link
Contributor

fridex commented Jan 27, 2022

/remove-lifecycle stale

@sesheta sesheta removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2022
@sesheta
Copy link
Member

sesheta commented Apr 27, 2022

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@sesheta sesheta added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 27, 2022
@sesheta
Copy link
Member

sesheta commented May 27, 2022

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

@sesheta sesheta added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 27, 2022
@codificat
Copy link
Member

/lifecycle frozen

@sesheta sesheta added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jun 8, 2022
@VannTen
Copy link
Member

VannTen commented Aug 30, 2022

/sig devsecops
I don't think we receive traffic in the pod, so a readiness probe would not really make sense.
For a liveness probe, isn't it simpler to simply exit with and error code while logging the error ? Instead of adding code to check, since it's only to SIGTERM anyway...

@sesheta sesheta added the sig/devsecops Categorizes an issue or PR as relevant to SIG DevSecOps. label Aug 30, 2022
@KPostOffice
Copy link
Member

/sig devsecops I don't think we receive traffic in the pod, so a readiness probe would not really make sense. For a liveness probe, isn't it simpler to simply exit with and error code while logging the error ? Instead of adding code to check, since it's only to SIGTERM anyway...

See the comment above about using activeDeadlineSeconds instead. This issue could probably do with an edit.

@VannTen
Copy link
Member

VannTen commented Sep 16, 2022

It makes sense to have a mechanism to kill the pod if it failed for some reason. Previously, we had issues from time to time, that a pod was stuck in pending state or in running state (but python interpreter was not running) due to some cluster issue. To prevent that, it might be good idea to configure activeDeadlineSeconds, also for other cronjobs we have.

Not the same level though. activeDeadlineSeconds is for failing the Job, regardless of the pod(s) (excerpt from the doc: "The activeDeadlineSeconds applies to the duration of the job, no matter how many Pods are created. Once a Job reaches activeDeadlineSeconds, all of its running Pods are terminated and the Job status will become type: Failed with reason: DeadlineExceeded")

This issue could probably do with an edit.

Agreed. @goern could you clarify what the end goal is ? In particular, re-reading the description:

add a liveness/readiness probe to faust producer deployment, so that we can check if package-update is working.

we should be able to do a basic testing if a new version of package-update is deployable and runnable.

-> seems like two differents things (Operations vs Testing).

@goern
Copy link
Member Author

goern commented Sep 22, 2022

from my point of view, it is an operational problem: package-update is a critical component, therefore we need to observe if it is working correctly. activeDeadlineSeconds seems to be a technical solution to 'an auto-healing attempt', but it does not help with observing this service.

Maybe we should close this and #49 and restate

As a Thoth Operator,
I want to observe the package-update-job,
so that I can figure out if it is being executed,
and so that in the case of its failure a support issue is opened.

wdygt?

@VannTen
Copy link
Member

VannTen commented Sep 22, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. hacktoberfest Issues targeting the hacktoberfest participants. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/devsecops Categorizes an issue or PR as relevant to SIG DevSecOps. thoth/group-programming This issue could be used for group programming, offer or request. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

7 participants