Failures of default liveness / readiness probes #306

yorugac · 2023-10-10T06:35:38Z

Several times we have received reports of failing default liveness and readiness probes which leads to runners not reaching ready state:

  Normal   Created    3m2s                   kubelet            Created container k6
  Normal   Started    3m2s                   kubelet            Started container k6
  Warning  Unhealthy  2m34s (x7 over 3m1s)   kubelet            Readiness probe failed: Get "http://10.1.0.75:6565/v1/status": dial tcp 10.1.0.75:6565: connect: connection refused
  Warning  Unhealthy  2m34s (x3 over 2m54s)  kubelet            Liveness probe failed: Get "http://10.1.0.75:6565/v1/status": dial tcp 10.1.0.75:6565: connect: connection refused
  Normal   Killing    2m34s                  kubelet            Stopping container k6

Such error does not allow to start the test and the test is stuck (at the moment, but this part will likely change as result of #260).

Default probes are the same defaults that are set by Kubernetes, i.e. k6-operator does not add anything by itself. k6 HTTP starts pretty quickly for most tests and it should be well reachable with default settings normally. The most confounding part is that this failures seemed to have been reported while using simple and basic setups of Kubernetes, even local clusters.

So far it is not clear how to repeat this case. If someone knows a trick how to repeat this, please share in this issue.

cc @dgzlopes

The text was updated successfully, but these errors were encountered:

Re3v3s · 2024-01-21T14:40:35Z

I face this issue also , anyone help please

frittentheke · 2024-08-12T15:40:59Z

These probes are added here:

k6-operator/pkg/resources/jobs/runner.go

Lines 188 to 189 in 40dce3f

    
           LivenessProbe:   generateProbe(k6.GetSpec().Runner.LivenessProbe), 
        
           ReadinessProbe:  generateProbe(k6.GetSpec().Runner.ReadinessProbe),

Using the helper

k6-operator/pkg/resources/jobs/runner.go

Line 279 in 40dce3f

func generateProbe(configuredProbe *corev1.Probe) *corev1.Probe {

which targets the status endpoint of the K6 REST API - https://k6.io/docs/misc/k6-rest-api/#get-status
Since there are no options set for the probe (apart from the HTTP destination to be used) the default timeouts are applied:

[...]
    livenessProbe:                                                                                                                                                                                                                                                                       
      failureThreshold: 3                                                                                                                                                                                                                                                                
      httpGet:                                                                                                                                                                                                                                                                           
        path: /v1/status                                                                                                                                                                                                                                                                 
        port: 6565                                                                                                                                                                                                                                                                       
        scheme: HTTP                                                                                                                                                                                                                                                                     
      periodSeconds: 10                                                                                                                                                                                                                                                                  
      successThreshold: 1                                                                                                                                                                                                                                                                
      timeoutSeconds: 1

    readinessProbe:                                                                                                                                                                                                                                                                      
      failureThreshold: 3                                                                                                                                                                                                                                                                
      httpGet:                                                                                                                                                                                                                                                                           
        path: /v1/status                                                                                                                                                                                                                                                                 
        port: 6565                                                                                                                                                                                                                                                                       
        scheme: HTTP                                                                                                                                                                                                                                                                     
      periodSeconds: 10                                                                                                                                                                                                                                                                  
      successThreshold: 1                                                                                                                                                                                                                                                                
      timeoutSeconds: 1
[...]

which are very likely a little too aggressive?

I am wondering what good the readiness probe does here currently anyways, since the operating is actually using the Pod status of running (

k6-operator/controllers/k6_start.go

Line 51 in 40dce3f

if pod.Status.Phase != "Running" {

) to determine when to send the start command. But since a running pod is not a ready application it might make sense to switch to using the readiness.
Regarding fixing this issue here, I shall push a PR to increase the timeouts a little.

…delay Currently the default timeout of 1 second and no initial delay is applied to the probes of the runner pods. Depending on the startup time this can cause random Pod errors causing a whole TestRun to fail. At some point it might also make sense to introduce a startupProbe to cover the longer initial startup time a K6 instance (Pod) might need instead of every increasing the runtime liveness and readiness checks. Fixes grafana#306 Signed-off-by: Christian Rohmann <[email protected]>

…delay Currently the default timeout of 1 second and no initial delay is applied to the probes of the runner pods. Depending on the startup time this can cause random Pod errors causing a whole TestRun to fail. At some point it might also make sense to introduce a startupProbe to cover the longer initial startup time a K6 instance / pod might need instead of ever increasing the runtime liveness and readiness checks. Since having the Liveness and Readiness checks be just the same makes not much sense, as the liveness check fail will cause the container to be restarted, this change also splits up those two tests, to allow for more individual configuration, be it timers or what is actually checked. Fixes grafana#306 Signed-off-by: Christian Rohmann <[email protected]>

yorugac · 2024-08-20T15:33:41Z

@frittentheke, thank you for looking into it and the PR 🙌

Actually, IIRC, in the forum's discussion, this issue couldn't be mitigated even with other values (at least for some users?) 🤔 For context, one can set another set of values in TestRun with .spec.runner.readinessProbe and .spec.runner.livenessProbe.

which are very likely a little too aggressive?

Maybe but then I wonder why Kubernetes is using those values 😄

TBH, I don't much like solution with increased timeout as in some other setup, one would need even larger timeout: we cannot be changing timeout each time in code. That's why the options for runner's probes are exposed, after all.

OTOH, this issue was strongly upvoted so this is clearly causing lots of issues 😞 Part of the issue actually might be that we still don't have a proper doc for CRD's options, so it's hard for people to find the available options for those probes.

@frittentheke, let me ponder this topic a bit - I'll get back to you in the PR!

frittentheke · 2024-08-20T16:25:07Z

thanks for your response @yorugac !

TBH, I don't much like solution with increased timeout as in some other setup, one would need even larger timeout: we cannot be changing timeout each time in code. That's why the options for runner's probes are exposed, after all.

OTOH, this issue was strongly upvoted so this is clearly causing lots of issues 😞 Part of the issue actually might be that we still don't have a proper doc for CRD's options, so it's hard for people to find the available options for those probes.

Yes, but this is the k6 operator after all. As an operator that is starting instances of k6 (jobs -> pods) it has be able to do this without any manual fiddling with options such as startup probes, liveness and readiness probes.

To me the operator should NOT expose all possible resource specs via the TestRun CR, but a carefully curated few that are required to make TestRuns be adopted to different environments and work automagically for everything else. Otherwise the operator degrades to just be a template engine for other Kubernetes resources and betrays the operator concept.

To get back to topic: Yes, the resource requirements and limits might be different depending on the test case, VUs, targets, so they need adjustment. But the sheer monitoring of the k6 instances (pods) in regards to their health and readiness to run a TestRun is somewhat that and operator should totally take care of.

which are very likely a little too aggressive?
Maybe but then I wonder why Kubernetes is using those values 😄

Liveness and Readiness are simple monitoring mechanisms that certainly require adaption to different application. This covers the endpoint that is queried, but also the frequency and timeouts. If the k6 startup time is highly dependent on e.g. the tests that are loaded and the liveness check might need a somewhat longer backoff, the startup probe (phase) might be a better approach. But again, to be this should be something the operator can handle as it knows k6 best.

Let me also state that the API endpoint used to determine liveness might no have been intended being used as liveness check. So maybe there could also be an improvement to the K6 binary to bring up the API and a health endpoint quicker and prior loading all of them tests? Or maybe the k6 operator should use something else to test for k6 being alive? ....

Please also see me comment #438 (comment)

yorugac · 2024-08-22T13:08:13Z

To me the operator should NOT expose all possible resource specs via the TestRun CR, but a carefully curated few that are required to make TestRuns be adopted to different environments and work automagically for everything else.

It is certainly one way to look at this domain problem. I had similar doubts myself in the past. But at this point of time, I have to disagree here as TestRun CRD has naturally evolved to allow as much flexibility as possible; it was never meant to be an "auto-magical" interface. There were multiple reasons why it evolved in this way. We are quite unlikely to change this now, esp. when there hasn't been a strong enough request about it from the community, as far as we know.
IMO, the main value of TestRun lies actually in coordinating distributed test runs, not in making "smart" decisions about user's infrastructure.

If you are interested in opinionated interface, then we are currently working on one such interface -- PrivateLoadZone as part of Cloud solution.

Hope that answers your questions. Let's keep the thread on the main topic 🙂

Perhaps, my joke about Kubernetes' defaults was easy to misinterpret, my apologies; I'll clarify. The default values for probes weren't passed as is in k6-operator simply because they are default, but because specifics of k6 startup was taken into account at the time: you're correct that k6 binary itself dictates certain conditions. Perhaps, something has changed since then or the estimations were wrong: as mentioned above, I'll grok the topic more and let you know in the PR. (Of course, you're very welcome to make those estimations yourself!)

However, currently, there is very little, except vague guesses, that explains why default values work at one local cluster and fail at another (as in the initial description of this issue). And nothing at all ATM explains users who failed to start a test, even when configuring probes manually, as I described in my previous comment.

Btw, did you experience such failures with default values as well? If so, which setup exhibited those?

frittentheke · 2024-08-23T13:10:17Z

Perhaps, something has changed since then or the estimations were wrong: as mentioned above, I'll grok the topic more and let you know in the PR. (Of course, you're very welcome to make those estimations yourself!)
[...]
However, currently, there is very little, except vague guesses, that explains why default values work at one local cluster and fail at another (as in the initial description of this issue). And nothing at all ATM explains users who failed to start a test, even when configuring probes manually, as I described in my previous comment.

Btw, did you experience such failures with default values as well? If so, which setup exhibited those?

Let's break down what the config currently says:

delay=0s timeout=1s period=10s #success=1 #failure=3 k6 has to respond within one second or otherwise the liveness check is already considered failed.

I shall try and debug / profile the response behavior a little more to provide more details and to make an educated decision on changes to the timeouts.

yorugac added help wanted Extra attention is needed question Further information is requested labels Oct 10, 2023

frittentheke linked a pull request Aug 12, 2024 that will close this issue

Increase readiness and liveness probe timeouts, introduce an initial delay #438

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failures of default liveness / readiness probes #306

Failures of default liveness / readiness probes #306

yorugac commented Oct 10, 2023

Re3v3s commented Jan 21, 2024

frittentheke commented Aug 12, 2024

yorugac commented Aug 20, 2024

frittentheke commented Aug 20, 2024 •

edited

Loading

yorugac commented Aug 22, 2024 •

edited

Loading

frittentheke commented Aug 23, 2024

Failures of default liveness / readiness probes #306

Failures of default liveness / readiness probes #306

Comments

yorugac commented Oct 10, 2023

Re3v3s commented Jan 21, 2024

frittentheke commented Aug 12, 2024

yorugac commented Aug 20, 2024

frittentheke commented Aug 20, 2024 • edited Loading

yorugac commented Aug 22, 2024 • edited Loading

frittentheke commented Aug 23, 2024

frittentheke commented Aug 20, 2024 •

edited

Loading

yorugac commented Aug 22, 2024 •

edited

Loading