-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add liveness/readiness probes #140
Comments
This can be generalized to any message producer in deployment. |
Except for user-facing api. |
Hi there 👋, I'm willing to help you regarding to adding liveness and readiness to Do you have 5 minutes to onboard me on your release management process ? Thanks, |
Hi! thanks for your interest! We deploy from https://github.com/thoth-station/thoth-application repo, you can find package-update related bits in https://github.com/thoth-station/thoth-application/blob/master/package-update/base/cronjob.yaml There is already existing liveness probe that could be changed: It's worth to consider if the liveness probe should not be more sophisticated. See also related discussion at robinhood/faust#286. Sadly we do not have any public instance, accessible to let you test your changes but we can definitely cooperate. F. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
@fridex package-update is a cronjob, do probes make sense here? |
It makes sense to have a mechanism to kill the pod if it failed for some reason. Previously, we had issues from time to time, that a pod was stuck in pending state or in running state (but python interpreter was not running) due to some cluster issue. To prevent that, it might be good idea to configure |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. /close |
@sesheta: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/remove-lifecycle rotten |
@fridex: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. /close |
@sesheta: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The bot was misbehaving at that time, as the issue was not flagged as rotten at that point. Fixing and adding a priority /reopen |
@codificat: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with /lifecycle rotten |
/lifecycle frozen |
/sig devsecops |
See the comment above about using |
Not the same level though.
Agreed. @goern could you clarify what the end goal is ? In particular, re-reading the description:
-> seems like two differents things (Operations vs Testing). |
from my point of view, it is an operational problem: package-update is a critical component, therefore we need to observe if it is working correctly. Maybe we should close this and #49 and restate As a Thoth Operator, wdygt? |
On Wed, Sep 21, 2022 at 11:49:42PM -0700, Christoph Görn wrote:
from my point of view, it is an operational problem: package-update is a critical component, therefore we need to observe if it is working correctly. `activeDeadlineSeconds` seems to be a technical solution to 'an auto-healing attempt', but it does not help with observing this service.
That sums it up. `activeDeadlineSeconds` would prevent a deadlock/livelock to goes undetected for
too long.
Maybe we should close this and #49 and restate
As a Thoth Operator,
I want to observe the package-update-job,
so that I can figure out if it is being executed,
and so that in the case of its failure a support issue is opened.
wdygt?
Yes, I think we should rephrase the issue.
The metrics are already there for observing job state (kube-state-metrics has `kube_job_failed` it
seems) so we would create an alert and maybe something like
https://github.com/m-lab/alertmanager-github-receiver for hooking it up into github issues.
By the way, that approach (`activeDeadlineSeconds` or more broadly, timeout, which I suppose can be
expressed in different ways for argo/tekton etc) would scale to other jobs we have.
For example, thoth-station/thoth-application#2604 was an instance of a job
not finishing (well the pod was crashing but was restarted).
|
Is your feature request related to a problem? Please describe.
add a liveness/readiness probe to faust producer deployment, so that we can check if package-update is working.
Describe the solution you'd like
Describe alternatives you've considered
no probes
Additional context
we should be able to do a basic testing if a new version of package-update is deployable and runnable.
The text was updated successfully, but these errors were encountered: