Allow resuming and suspending tests by changing a live `TestRun` resource #457

LCaparelli · 2024-08-29T18:10:04Z

This feature request was initially described as a bugfix due to a misunderstanding of the spec.paused field.

Initial report

Brief summary

While the operator correctly propagates the configuration as part of the flags for the runner:

k6-operator/pkg/resources/jobs/runner.go

Lines 61 to 68 in 0f3957c

    
           paused := true 
        
           if k6.GetSpec().Paused != "" { 
        
           	paused, _ = strconv.ParseBool(k6.GetSpec().Paused) 
        
           } 
        
           if paused { 
        
           	command = append(command, "--paused") 
        
           }

It makes no checks prior to creating the starter:

k6-operator/controllers/testrun_controller.go

Lines 193 to 197 in 0f3957c

    
           case "initialized": 
        
           	return CreateJobs(ctx, log, k6, r) 
        
           case "created": 
        
           	return StartJobs(ctx, log, k6, r)

Neither functions above make any checks regarding spec.paused AFAICT.

Since the starter job always unpauses the the runner:

k6-operator/pkg/resources/containers/curl_start.go

Lines 15 to 30 in 0f3957c

    
           func NewStartContainer(hostnames []string, image string, imagePullPolicy corev1.PullPolicy, command []string, env []corev1.EnvVar, securityContext corev1.SecurityContext) corev1.Container { 
        
           	req, _ := json.Marshal( 
        
           		types.StatusAPIRequest{ 
        
           			Data: types.StatusAPIRequestData{ 
        
           				Attributes: types.StatusAPIRequestDataAttributes{ 
        
           					Paused: false, 
        
           				}, 
        
           				ID:   "default", 
        
           				Type: "status", 
        
           			}, 
        
           		}) 
        
           	var parts []string 
        
           	for _, hostname := range hostnames { 
        
           		parts = append(parts, fmt.Sprintf("curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://%s:6565/v1/status -d '%s'", hostname, req)) 
        
           	}

We end up with spec.paused being ignored (technically, partially ignored because the --paused flag does get added to the runner arguments, but in practice it always unpauses due to how the starter behaves).

Suggested implementation

We might solve this by just adding an if statement when the status is "created", guarding the StartJobs function:

	case "created":
                if k6.IsPaused() {
                        // nothing to do. When the user updates spec.paused a new reconciliation will trigger and we'll check again
                        return ctrl.Result{}, nil 
                }
		return StartJobs(ctx, log, k6, r)

Of course, this implies that changing spec.paused after the test has started continues to have no effect, but that seems sensible to me.

k6-operator version or image

0.0.16

Helm chart version (if applicable)

No response

TestRun / PrivateLoadZone YAML

Anything with spec.paused set to 'true'.

Other environment details (if applicable)

No response

Steps to reproduce the problem

Create a TestRun with spec.paused set to 'true'
Wait for jobs to run

Expected behaviour

Runner does not start the test before I set spec.paused to 'false' or null.

Actual behaviour

Starter behaves exactly as if spec.paused was set to 'false' or null, and immediately starts the runner.

Feature Description

It would be nice if we could edit a live TestRun resource to suspend or resume it, similarly to what Kubernetes does with Jobs.

Already existing or connected issues / PRs (optional)

No response

The text was updated successfully, but these errors were encountered:

LCaparelli · 2024-08-29T18:12:14Z

I am willing to work on this fix, this feature is important to us as we'd like our developers to be able to declare their tests beforehand via GitOps and only start them when appropriate (which we'll achieve using ArgoCD's Custom Resource Actions, adding a resume button next to the resource that can trigger the test imperatively).

yorugac · 2024-09-05T10:31:06Z

Hi @LCaparelli, thanks for the issue! I think there is a misunderstanding about expected behaviour. spec.paused is true by default in k6-operator. So you setting it to true in TestRun spec indeed would change nothing: that's expected.

In standalone k6, --paused is false by default: not so in k6-operator. Because unlike standalone k6, k6-operator targets logic of distributed tests by default. So paused is true by default for all test runs, because that's how we ensure that all runners start at the same time via starter. IOW, it's the default logic for any distributed test run.

If someone wants to change that logic, they can set spec.paused: false: then, runners would start immediately upon creation. Given that there are delays between scheduling of the pods, the runners will end up starting at different time points. In general, that is not recommended for distributed test runs, because it skews the results of the test. It's an "advanced" option, one could say.

So it seems to me like there's a lack of clarify about expected behaviour here 🤔 Obviously, spec.paused is 1-1 mapping to k6 option. k6 docs for the relevant option:
https://grafana.com/docs/k6/latest/using-k6/k6-options/reference/#paused
Perhaps, we could improve UX here, to make it clearer that spec.paused is true by default.

Could you please describe in greater details your use case? I.e. what you were trying to achieve exactly and why changing paused seemed like the right solution?

yorugac · 2024-09-05T10:35:49Z

Just for completeness sake. If one sets spec.paused: false, it does make starter logic redundant. Starter pod is executed but runners are already running on their own. So we could optimize that for this specific case, as you suggested, IIUC. IOW, there is no need to schedule starter pod with spec.paused: false. But that would be an optimization of resources we spend per TestRun for this specific case: it will not change the default logic.

LCaparelli · 2024-09-05T16:52:05Z

Hey @yorugac thanks for taking some time to go over this, I appreciate it!

You're correct that I initially misunderstood the default, I realized that while implementing #458.

I think this misunderstanding doesn't change my report though: regardless of what I set to .spec.paused, the starter job ALWAYS runs, and always unpauses the test. Hence my suggested fix: don't run the starter job if spec.paused is set to "false". Am I missing something here?

LCaparelli · 2024-09-05T16:54:16Z

In an attempt to further clarify: we do start the runner with the --paused flag, so indeed it starts paused. The problem is that in the steps for unpausing, which is carried out by the starter job, there is nothing (I believe) checking .spec.paused, so it always forces the test to start.

yorugac · 2024-09-06T09:22:41Z

Hi @LCaparelli,

the starter job ALWAYS runs, and always unpauses the test.

Yes, right now starter always runs and unpauses the test, if there is anything to unpause in the first place. For simplicity's sake, let's disregard starter pod for now and focus only on the expected behaviour of the test run.

In case of .spec.paused: false, k6 command will not have --paused: the runners start as soon as they can.
In case of .spec.paused: true (default behaviour), k6 command will have --paused: the runners wait until all runners are up and ready to start.

But I have an impression that your use case might need something totally different. For example, you mentioned ArgoCD before: that implies some kind of manual control over test execution from the outside app. I.e. do you need to delay the execution of the test until some external action? If so, how should that look like from your perspective? Via modification of TestRun object ? (we have an issue for that actually)

LCaparelli · 2024-09-09T20:32:45Z

I.e. do you need to delay the execution of the test until some external action?

That's exactly it! I think my misunderstanding of this field is deeper than I thought. My understanding is that if spec.paused == "true" || spec.paused == null, no test runner will ever run any tests (if they haven't started yet).

In light of that, what I thought was a bugfix might actually be a feat request.

if so, how should that look like from your perspective? Via modification of TestRun object ?

Precisely, we would like to optionally start the test as paused, in which case spec.paused would need to be changed after creation.

(we have an #305 for that actually)

That's much more generic than we need for our use-case right now, but thanks for pointing it out! There seems to be a few can of worms there, as to "How to deal with a change in this field consistently? Should we even allow modifying this field? If so, when should we allow it to be modified?".

One of these problems also appear here: what happens if the user changes spec.paused after the test has begun? Which, IMO, we should just ignore or throw an error/warning "hey, you changed this field when the test had already started, and it had no effect".

At the risk of giving unwelcome advice, I think it might be a good idea to map what fields might support updating based on use-cases as they appear, such as this one. Trying to tackle update support for the entire API all at once sounds very complex, and concrete use-cases add much needed clarification to the decision-making.

Back to the matter at hand though, do you think this feat request would be a welcome one? Should I further illustrate our use-case to justify it? I'm also open to a quick, sync call or even just chatting on Slack to iron out the noise.

Thanks again for your time!

LCaparelli · 2024-09-09T20:39:18Z

Anecdotally, I was not the only one to assume spec.paused == "true" would prevent all runners from ever starting: https://community.grafana.com/t/v0-0-14-testrun-property-spec-paused-not-behaving-as-expected/124255

yorugac · 2024-09-10T11:14:31Z

@LCaparelli, indeed, this is a feature request 😄

do you think this feat request would be a welcome one?

I think it is a valid feature request, at this point of time. As you noted, this is not the first time we receive a similar request recently. OTOH, a move to allow edits of CRD is a big one for this project, and I don't think it'll be possible to add this in the near future. I.e. other issues have higher priority, IMHO.

Could you please adjust the description of the issue to reflect the core problem (starting test run by edit of CRD) better? Then we can just refer to this issue with our discussion as a "feature request" instead of writing a new one.

On implementation. Someone might be relying on current behaviour of .spec.paused: what you're suggesting would break their workflows in that case. I also think that "paused" itself is a bit ambiguous, esp. given its usage in k6: you can only set it as "true", not as "false". So I imagine we'll need to introduce a new field like .spec.suspend.

Another potential point of contention is adding mutability requirement in general. Can there be GitOps setups that would need some other type of control over TestRun start, which does not rely on mutability of the resource? I.e. ~ is this truly solving the right problem.

LCaparelli · 2024-09-25T20:05:33Z

Hey @yorugac, sorry for the long pause in discussion, PTO.

On implementation. Someone might be relying on current behaviour of .spec.paused: what you're suggesting would break their workflows in that case. I also think that "paused" itself is a bit ambiguous, esp. given its usage in k6: you can only set it as "true", not as "false". So I imagine we'll need to introduce a new field like .spec.suspend.

Agreed, and it nicely fits into the spec.suspend semantics of Job.

Which brings me to a new suggestion of implementation: just propagate spec.suspended from TestRun into runner and starter jobs, but not into initializer jobs: we're ok with everything being initialized, we just want to be able to resume/suspend the tests themselves.

Because we're delegating to Job, whenever the run gets suspended while it's already running, the starter and runner pods will be terminated (useful if you accidentally triggered costly tests).

Any thoughts on this approach?

Another potential point of contention is adding mutability requirement in general. Can there be GitOps setups that would need some other type of control over TestRun start, which does not rely on mutability of the resource? I.e. ~ is this truly solving the right problem.

This point isn't very clear to me. I couldn't think of an example setup where we can declarative configure whether a TestRun is suspended without mutating the resource the resource itself.

I think I have enough to start changing my PR, but this final point needs some elaboration for me to grok it, thanks!

yorugac · 2024-09-30T13:43:07Z

Hi @LCaparelli, NP, it'd be great if you're able to work on this!

Which brings me to a new suggestion of implementation: just propagate spec.suspended from TestRun into runner and starter jobs, but not into initializer jobs: we're ok with everything being initialized, we just want to be able to resume/suspend the tests themselves.

The Job's suspend does not even schedule pods until it changes. And it also allows Job controller to remove the pods mid-execution. This is definitely not the logic we want for TestRun: the .suspend here should allow to schedule pods (runners) and services but delay the starter pod until .suspend becomes false. And if someone changes .suspend after the test has started, I think it should be ignored by TestRun controller, at least in this first version.

Regarding accidental costly tests: it's possible to terminate them with the deletion of TestRun CR.

Another potential point of contention is adding mutability requirement in general. Can there be GitOps setups that would need some other type of control over TestRun start, which does not rely on mutability of the resource? I.e. ~ is this truly solving the right problem.

This point isn't very clear to me. I couldn't think of an example setup where we can declarative configure whether a TestRun is suspended without mutating the resource the resource itself.

I was basically thinking about setups where mutation of a resource is forbidden 😄 Then it is possible that the solution is not in suspend boolean field but in something like timeToStart or delay field with duration, which specifies when the test should be started. But this is a hypothesis at this point: we received no feedback like that, yet, so I guess we can safely assume that the main use cases would be covered with mutability and .suspend boolean field 👍

LCaparelli · 2024-10-04T13:30:20Z

I was basically thinking about setups where mutation of a resource is forbidden 😄 Then it is possible that the solution is not in suspend boolean field but in something like timeToStart or delay field with duration, which specifies when the test should be started. But this is a hypothesis at this point: we received no feedback like that, yet, so I guess we can safely assume that the main use cases would be covered with mutability and .suspend boolean field 👍

Funny how things happen. Folks at our org just requested this feature. They would like both:

setting a one-shot time where tests would start
setting a schedule for recurrent executions (Cronjob-ish)

I'll open a separate issue for that, but there are still folks who would rather rely on a more interactive approach, so it sounds like the .suspend field we've been discussing would still be useful.

yorugac · 2024-11-04T14:01:55Z

Hi @LCaparelli, apologies for the delay; I missed the last message in my notifications 😞

I'll open a separate issue for that, but there are still folks who would rather rely on a more interactive approach, so it sounds like the .suspend field we've been discussing would still be useful.

I agree with you: the feature request for delays should be a separate issue (btw, a "pure" CronJob-like scheduling should be possible with a CronJob, it's described here).

This issue is then about adding spec.suspend as the first mutable field in TestRun 👍
There is your open PR: not to put pressure on you, but would you like to continue working on that? 🙂

LCaparelli · 2024-11-04T14:27:57Z

Yes, I would! Chaotic few weeks, I'll adjust it soon. Thanks for checking in @yorugac

LCaparelli added the bug Something isn't working label Aug 29, 2024

LCaparelli linked a pull request Aug 29, 2024 that will close this issue

fix: don't run starter job if TestRun/K6 is paused #458

Open

yorugac added enhancement New feature or request and removed bug Something isn't working labels Sep 10, 2024

LCaparelli changed the title ~~spec.paused produces --paused flag on runner, but initializer always unpause it regardless~~ Allow resuming and suspending tests by changing a live TestRun resource Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow resuming and suspending tests by changing a live `TestRun` resource #457

Allow resuming and suspending tests by changing a live `TestRun` resource #457

LCaparelli commented Aug 29, 2024 •

edited

Loading

Brief summary

Suggested implementation

k6-operator version or image

Helm chart version (if applicable)

TestRun / PrivateLoadZone YAML

Other environment details (if applicable)

Steps to reproduce the problem

Expected behaviour

Actual behaviour

LCaparelli commented Aug 29, 2024 •

edited

Loading

yorugac commented Sep 5, 2024 •

edited

Loading

yorugac commented Sep 5, 2024

LCaparelli commented Sep 5, 2024 •

edited

Loading

LCaparelli commented Sep 5, 2024

yorugac commented Sep 6, 2024

LCaparelli commented Sep 9, 2024 •

edited

Loading

LCaparelli commented Sep 9, 2024

yorugac commented Sep 10, 2024

LCaparelli commented Sep 25, 2024

yorugac commented Sep 30, 2024

LCaparelli commented Oct 4, 2024

yorugac commented Nov 4, 2024

LCaparelli commented Nov 4, 2024

Allow resuming and suspending tests by changing a live TestRun resource #457

Allow resuming and suspending tests by changing a live TestRun resource #457

Comments

LCaparelli commented Aug 29, 2024 • edited Loading

Brief summary

Suggested implementation

k6-operator version or image

Helm chart version (if applicable)

TestRun / PrivateLoadZone YAML

Other environment details (if applicable)

Steps to reproduce the problem

Expected behaviour

Actual behaviour

Feature Description

Suggested Solution (optional)

Already existing or connected issues / PRs (optional)

LCaparelli commented Aug 29, 2024 • edited Loading

yorugac commented Sep 5, 2024 • edited Loading

yorugac commented Sep 5, 2024

LCaparelli commented Sep 5, 2024 • edited Loading

LCaparelli commented Sep 5, 2024

yorugac commented Sep 6, 2024

LCaparelli commented Sep 9, 2024 • edited Loading

LCaparelli commented Sep 9, 2024

yorugac commented Sep 10, 2024

LCaparelli commented Sep 25, 2024

yorugac commented Sep 30, 2024

LCaparelli commented Oct 4, 2024

yorugac commented Nov 4, 2024

LCaparelli commented Nov 4, 2024

Allow resuming and suspending tests by changing a live `TestRun` resource #457

Allow resuming and suspending tests by changing a live `TestRun` resource #457

LCaparelli commented Aug 29, 2024 •

edited

Loading

LCaparelli commented Aug 29, 2024 •

edited

Loading

yorugac commented Sep 5, 2024 •

edited

Loading

LCaparelli commented Sep 5, 2024 •

edited

Loading

LCaparelli commented Sep 9, 2024 •

edited

Loading