Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failures of default liveness / readiness probes #306

Open
yorugac opened this issue Oct 10, 2023 · 6 comments · May be fixed by #438
Open

Failures of default liveness / readiness probes #306

yorugac opened this issue Oct 10, 2023 · 6 comments · May be fixed by #438
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@yorugac
Copy link
Collaborator

yorugac commented Oct 10, 2023

Several times we have received reports of failing default liveness and readiness probes which leads to runners not reaching ready state:

  Normal   Created    3m2s                   kubelet            Created container k6
  Normal   Started    3m2s                   kubelet            Started container k6
  Warning  Unhealthy  2m34s (x7 over 3m1s)   kubelet            Readiness probe failed: Get "http://10.1.0.75:6565/v1/status": dial tcp 10.1.0.75:6565: connect: connection refused
  Warning  Unhealthy  2m34s (x3 over 2m54s)  kubelet            Liveness probe failed: Get "http://10.1.0.75:6565/v1/status": dial tcp 10.1.0.75:6565: connect: connection refused
  Normal   Killing    2m34s                  kubelet            Stopping container k6

Such error does not allow to start the test and the test is stuck (at the moment, but this part will likely change as result of #260).

Default probes are the same defaults that are set by Kubernetes, i.e. k6-operator does not add anything by itself. k6 HTTP starts pretty quickly for most tests and it should be well reachable with default settings normally. The most confounding part is that this failures seemed to have been reported while using simple and basic setups of Kubernetes, even local clusters.

So far it is not clear how to repeat this case. If someone knows a trick how to repeat this, please share in this issue.

cc @dgzlopes

@yorugac yorugac added help wanted Extra attention is needed question Further information is requested labels Oct 10, 2023
@Re3v3s
Copy link

Re3v3s commented Jan 21, 2024

I face this issue also , anyone help please

@frittentheke
Copy link
Contributor

  1. These probes are added here:

    LivenessProbe: generateProbe(k6.GetSpec().Runner.LivenessProbe),
    ReadinessProbe: generateProbe(k6.GetSpec().Runner.ReadinessProbe),

  2. Using the helper

    func generateProbe(configuredProbe *corev1.Probe) *corev1.Probe {
    which targets the status endpoint of the K6 REST API - https://k6.io/docs/misc/k6-rest-api/#get-status

  3. Since there are no options set for the probe (apart from the HTTP destination to be used) the default timeouts are applied:

[...]
    livenessProbe:                                                                                                                                                                                                                                                                       
      failureThreshold: 3                                                                                                                                                                                                                                                                
      httpGet:                                                                                                                                                                                                                                                                           
        path: /v1/status                                                                                                                                                                                                                                                                 
        port: 6565                                                                                                                                                                                                                                                                       
        scheme: HTTP                                                                                                                                                                                                                                                                     
      periodSeconds: 10                                                                                                                                                                                                                                                                  
      successThreshold: 1                                                                                                                                                                                                                                                                
      timeoutSeconds: 1

    readinessProbe:                                                                                                                                                                                                                                                                      
      failureThreshold: 3                                                                                                                                                                                                                                                                
      httpGet:                                                                                                                                                                                                                                                                           
        path: /v1/status                                                                                                                                                                                                                                                                 
        port: 6565                                                                                                                                                                                                                                                                       
        scheme: HTTP                                                                                                                                                                                                                                                                     
      periodSeconds: 10                                                                                                                                                                                                                                                                  
      successThreshold: 1                                                                                                                                                                                                                                                                
      timeoutSeconds: 1
[...]      

which are very likely a little too aggressive?

  1. I am wondering what good the readiness probe does here currently anyways, since the operating is actually using the Pod status of running (

    if pod.Status.Phase != "Running" {
    ) to determine when to send the start command. But since a running pod is not a ready application it might make sense to switch to using the readiness.

  2. Regarding fixing this issue here, I shall push a PR to increase the timeouts a little.

frittentheke added a commit to frittentheke/k6-operator that referenced this issue Aug 12, 2024
…delay

Currently the default timeout of 1 second and no initial delay is applied
to the probes of the runner pods. Depending on the startup time this
can cause random Pod errors causing a whole TestRun to fail.

At some point it might also make sense to introduce a startupProbe to cover
the longer initial startup time a K6 instance (Pod) might need instead of every
increasing the runtime liveness and readiness checks.

Fixes grafana#306

Signed-off-by: Christian Rohmann <[email protected]>
frittentheke added a commit to frittentheke/k6-operator that referenced this issue Aug 13, 2024
…delay

Currently the default timeout of 1 second and no initial delay is applied
to the probes of the runner pods. Depending on the startup time this
can cause random Pod errors causing a whole TestRun to fail.

At some point it might also make sense to introduce a startupProbe to cover
the longer initial startup time a K6 instance / pod might need instead of ever
increasing the runtime liveness and readiness checks.

Since having the Liveness and Readiness checks be just the same makes not much
sense, as the liveness check fail will cause the container to be restarted,
this change also splits up those two tests, to allow for more individual
configuration, be it timers or what is actually checked.

Fixes grafana#306

Signed-off-by: Christian Rohmann <[email protected]>
frittentheke added a commit to frittentheke/k6-operator that referenced this issue Aug 13, 2024
…delay

Currently the default timeout of 1 second and no initial delay is applied
to the probes of the runner pods. Depending on the startup time this
can cause random Pod errors causing a whole TestRun to fail.

At some point it might also make sense to introduce a startupProbe to cover
the longer initial startup time a K6 instance / pod might need instead of ever
increasing the runtime liveness and readiness checks.

Since having the Liveness and Readiness checks be just the same makes not much
sense, as the liveness check fail will cause the container to be restarted,
this change also splits up those two tests, to allow for more individual
configuration, be it timers or what is actually checked.

Fixes grafana#306

Signed-off-by: Christian Rohmann <[email protected]>
@yorugac
Copy link
Collaborator Author

yorugac commented Aug 20, 2024

@frittentheke, thank you for looking into it and the PR 🙌

Actually, IIRC, in the forum's discussion, this issue couldn't be mitigated even with other values (at least for some users?) 🤔 For context, one can set another set of values in TestRun with .spec.runner.readinessProbe and .spec.runner.livenessProbe.

which are very likely a little too aggressive?

Maybe but then I wonder why Kubernetes is using those values 😄

TBH, I don't much like solution with increased timeout as in some other setup, one would need even larger timeout: we cannot be changing timeout each time in code. That's why the options for runner's probes are exposed, after all.

OTOH, this issue was strongly upvoted so this is clearly causing lots of issues 😞 Part of the issue actually might be that we still don't have a proper doc for CRD's options, so it's hard for people to find the available options for those probes.

@frittentheke, let me ponder this topic a bit - I'll get back to you in the PR!

@frittentheke
Copy link
Contributor

frittentheke commented Aug 20, 2024

thanks for your response @yorugac !

TBH, I don't much like solution with increased timeout as in some other setup, one would need even larger timeout: we cannot be changing timeout each time in code. That's why the options for runner's probes are exposed, after all.

OTOH, this issue was strongly upvoted so this is clearly causing lots of issues 😞 Part of the issue actually might be that we still don't have a proper doc for CRD's options, so it's hard for people to find the available options for those probes.

Yes, but this is the k6 operator after all. As an operator that is starting instances of k6 (jobs -> pods) it has be able to do this without any manual fiddling with options such as startup probes, liveness and readiness probes.

To me the operator should NOT expose all possible resource specs via the TestRun CR, but a carefully curated few that are required to make TestRuns be adopted to different environments and work automagically for everything else. Otherwise the operator degrades to just be a template engine for other Kubernetes resources and betrays the operator concept.

To get back to topic: Yes, the resource requirements and limits might be different depending on the test case, VUs, targets, so they need adjustment. But the sheer monitoring of the k6 instances (pods) in regards to their health and readiness to run a TestRun is somewhat that and operator should totally take care of.

which are very likely a little too aggressive?
Maybe but then I wonder why Kubernetes is using those values 😄

Liveness and Readiness are simple monitoring mechanisms that certainly require adaption to different application. This covers the endpoint that is queried, but also the frequency and timeouts. If the k6 startup time is highly dependent on e.g. the tests that are loaded and the liveness check might need a somewhat longer backoff, the startup probe (phase) might be a better approach. But again, to be this should be something the operator can handle as it knows k6 best.

Let me also state that the API endpoint used to determine liveness might no have been intended being used as liveness check. So maybe there could also be an improvement to the K6 binary to bring up the API and a health endpoint quicker and prior loading all of them tests? Or maybe the k6 operator should use something else to test for k6 being alive? ....

Please also see me comment #438 (comment)

@yorugac
Copy link
Collaborator Author

yorugac commented Aug 22, 2024

To me the operator should NOT expose all possible resource specs via the TestRun CR, but a carefully curated few that are required to make TestRuns be adopted to different environments and work automagically for everything else.

It is certainly one way to look at this domain problem. I had similar doubts myself in the past. But at this point of time, I have to disagree here as TestRun CRD has naturally evolved to allow as much flexibility as possible; it was never meant to be an "auto-magical" interface. There were multiple reasons why it evolved in this way. We are quite unlikely to change this now, esp. when there hasn't been a strong enough request about it from the community, as far as we know.
IMO, the main value of TestRun lies actually in coordinating distributed test runs, not in making "smart" decisions about user's infrastructure.

If you are interested in opinionated interface, then we are currently working on one such interface -- PrivateLoadZone as part of Cloud solution.

Hope that answers your questions. Let's keep the thread on the main topic 🙂


Perhaps, my joke about Kubernetes' defaults was easy to misinterpret, my apologies; I'll clarify. The default values for probes weren't passed as is in k6-operator simply because they are default, but because specifics of k6 startup was taken into account at the time: you're correct that k6 binary itself dictates certain conditions. Perhaps, something has changed since then or the estimations were wrong: as mentioned above, I'll grok the topic more and let you know in the PR. (Of course, you're very welcome to make those estimations yourself!)

However, currently, there is very little, except vague guesses, that explains why default values work at one local cluster and fail at another (as in the initial description of this issue). And nothing at all ATM explains users who failed to start a test, even when configuring probes manually, as I described in my previous comment.

Btw, did you experience such failures with default values as well? If so, which setup exhibited those?

@frittentheke
Copy link
Contributor

Perhaps, something has changed since then or the estimations were wrong: as mentioned above, I'll grok the topic more and let you know in the PR. (Of course, you're very welcome to make those estimations yourself!)
[...]
However, currently, there is very little, except vague guesses, that explains why default values work at one local cluster and fail at another (as in the initial description of this issue). And nothing at all ATM explains users who failed to start a test, even when configuring probes manually, as I described in my previous comment.

Btw, did you experience such failures with default values as well? If so, which setup exhibited those?

Let's break down what the config currently says:

delay=0s timeout=1s period=10s #success=1 #failure=3 k6 has to respond within one second or otherwise the liveness check is already considered failed.

I shall try and debug / profile the response behavior a little more to provide more details and to make an educated decision on changes to the timeouts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants