-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failures of default liveness / readiness probes #306
Comments
I face this issue also , anyone help please |
which are very likely a little too aggressive?
|
…delay Currently the default timeout of 1 second and no initial delay is applied to the probes of the runner pods. Depending on the startup time this can cause random Pod errors causing a whole TestRun to fail. At some point it might also make sense to introduce a startupProbe to cover the longer initial startup time a K6 instance (Pod) might need instead of every increasing the runtime liveness and readiness checks. Fixes grafana#306 Signed-off-by: Christian Rohmann <[email protected]>
…delay Currently the default timeout of 1 second and no initial delay is applied to the probes of the runner pods. Depending on the startup time this can cause random Pod errors causing a whole TestRun to fail. At some point it might also make sense to introduce a startupProbe to cover the longer initial startup time a K6 instance / pod might need instead of ever increasing the runtime liveness and readiness checks. Since having the Liveness and Readiness checks be just the same makes not much sense, as the liveness check fail will cause the container to be restarted, this change also splits up those two tests, to allow for more individual configuration, be it timers or what is actually checked. Fixes grafana#306 Signed-off-by: Christian Rohmann <[email protected]>
…delay Currently the default timeout of 1 second and no initial delay is applied to the probes of the runner pods. Depending on the startup time this can cause random Pod errors causing a whole TestRun to fail. At some point it might also make sense to introduce a startupProbe to cover the longer initial startup time a K6 instance / pod might need instead of ever increasing the runtime liveness and readiness checks. Since having the Liveness and Readiness checks be just the same makes not much sense, as the liveness check fail will cause the container to be restarted, this change also splits up those two tests, to allow for more individual configuration, be it timers or what is actually checked. Fixes grafana#306 Signed-off-by: Christian Rohmann <[email protected]>
@frittentheke, thank you for looking into it and the PR 🙌 Actually, IIRC, in the forum's discussion, this issue couldn't be mitigated even with other values (at least for some users?) 🤔 For context, one can set another set of values in
Maybe but then I wonder why Kubernetes is using those values 😄 TBH, I don't much like solution with increased timeout as in some other setup, one would need even larger timeout: we cannot be changing timeout each time in code. That's why the options for runner's probes are exposed, after all. OTOH, this issue was strongly upvoted so this is clearly causing lots of issues 😞 Part of the issue actually might be that we still don't have a proper doc for CRD's options, so it's hard for people to find the available options for those probes. @frittentheke, let me ponder this topic a bit - I'll get back to you in the PR! |
thanks for your response @yorugac !
Yes, but this is the k6 operator after all. As an operator that is starting instances of k6 (jobs -> pods) it has be able to do this without any manual fiddling with options such as startup probes, liveness and readiness probes. To me the operator should NOT expose all possible resource specs via the TestRun CR, but a carefully curated few that are required to make TestRuns be adopted to different environments and work automagically for everything else. Otherwise the operator degrades to just be a template engine for other Kubernetes resources and betrays the operator concept. To get back to topic: Yes, the resource requirements and limits might be different depending on the test case, VUs, targets, so they need adjustment. But the sheer monitoring of the k6 instances (pods) in regards to their health and readiness to run a TestRun is somewhat that and operator should totally take care of.
Let me also state that the API endpoint used to determine liveness might no have been intended being used as liveness check. So maybe there could also be an improvement to the K6 binary to bring up the API and a health endpoint quicker and prior loading all of them tests? Or maybe the k6 operator should use something else to test for k6 being alive? .... Please also see me comment #438 (comment) |
It is certainly one way to look at this domain problem. I had similar doubts myself in the past. But at this point of time, I have to disagree here as If you are interested in opinionated interface, then we are currently working on one such interface -- Hope that answers your questions. Let's keep the thread on the main topic 🙂 Perhaps, my joke about Kubernetes' defaults was easy to misinterpret, my apologies; I'll clarify. The default values for probes weren't passed as is in k6-operator simply because they are default, but because specifics of k6 startup was taken into account at the time: you're correct that k6 binary itself dictates certain conditions. Perhaps, something has changed since then or the estimations were wrong: as mentioned above, I'll grok the topic more and let you know in the PR. (Of course, you're very welcome to make those estimations yourself!) However, currently, there is very little, except vague guesses, that explains why default values work at one local cluster and fail at another (as in the initial description of this issue). And nothing at all ATM explains users who failed to start a test, even when configuring probes manually, as I described in my previous comment. Btw, did you experience such failures with default values as well? If so, which setup exhibited those? |
Let's break down what the config currently says:
I shall try and debug / profile the response behavior a little more to provide more details and to make an educated decision on changes to the timeouts. |
Several times we have received reports of failing default liveness and readiness probes which leads to runners not reaching ready state:
Such error does not allow to start the test and the test is stuck (at the moment, but this part will likely change as result of #260).
Default probes are the same defaults that are set by Kubernetes, i.e. k6-operator does not add anything by itself. k6 HTTP starts pretty quickly for most tests and it should be well reachable with default settings normally. The most confounding part is that this failures seemed to have been reported while using simple and basic setups of Kubernetes, even local clusters.
So far it is not clear how to repeat this case. If someone knows a trick how to repeat this, please share in this issue.
cc @dgzlopes
The text was updated successfully, but these errors were encountered: