apmsoak: do not stop on errors, continue until SIGINT #56

inge4pres · 2024-02-15T17:53:47Z

Reason for this PR

When ingesting traffic apmsoak will stop running on the first error returned by the server.

Details

add testing to parsing the event rate
include a signal trap for SIGINT to shutdown all workers

Further work

We should understand with @elastic/observablt-robots if there's an impact to quality gates with this change.

cmd/apmsoak/run.go

carsonip · 2024-02-15T18:14:16Z

internal/loadgen/eventhandler/handler.go

-			if s.burst > 0 {
-				s.sent = s.sent % s.burst
-			}
+		case err := <-sendErrs:


will race happen and accidentally enter this branch when we call close(sendErrs)?

no because the channel is closed by the producer above on context cancelation, so at the same time the first case branch in this select is chose. also consider that select has case ordering since 1.15 IIRC, this means that it will enter more likely the first branch if it finds both conditions can be satisfied.
can write a test for it if you think it's something we want to ensure.

select has case ordering since 1.15 IIRC

There is no case ordering. It is non-deterministic, see playground https://go.dev/play/p/Ic9i8MMKMQF

I am pretty sure there is a preference on the default case, and IIRC also when 2 select cases are ready simultaneously - it used to be more random but they changed it some time ago.

I am not able to point at the source right now, but the spec says
"the channel and right-hand-side expressions of send statements are evaluated exactly once, in source order, upon entering the "select" statement.".
Anyways, we may be able to remove this block, see #56 (comment)
If we use ignoreErrors as originally planned, we can discard this.

My understanding is that we can discard this as we are proceeding as mentioned in the linked comment?

when ignoreErrors is true and a signal is received, the sendBatch goroutine in line 264 returns because of <-ctx.Done() and calls line 265 close(sendErrs). then in the main goroutine, both cases in select are ready, and there's a chance that line 291 runs with err == nil. WDYT?

^ more like a nit, since this code is not critical and only logs a non-error during shutdown

when ignoreErrors is true

yeah I see I created confusion here, apologies - when IgnoreErrors is true, sendBatch never returns an error, so the if condition is redundant, see Transport.SendEvents.

the sendBatch goroutine in line 264 returns because of <-ctx.Done()

when the context is canceled, the error may end up in sendErrs, yes - err is not nil, it's context.Canceled, but the gorotuine either takes the first branch of select (and exits) or the message ends up in the channel, and then in the next iteration the channel is closed after return.
we may end up logging a spurious context canceled error, but note that at same time context is canceled, the outer gorotuine with the select at L#286 also returns - so it's very unlikely we hit that log line.
that's also the reason why the channel has a buffer of 1

adam-stokes · 2024-02-15T18:14:41Z

Misread this description, yes this is the behavior @elastic/observablt-pf want to keep, Thanks

inge4pres · 2024-02-16T13:14:53Z

this is the behavior @elastic/observablt-pf want to keep,

You mean failing on first error is the behavior you want to keep?
If yes, would be ok for you to have he default behavior to not stop on errors and add a --fail-on-error flag that would fail on the first error?
credit to @endorama for suggesting this ⬆️

inge4pres · 2024-02-16T14:50:02Z

The PR has been updated with 2 commits but I am not 100% sure we have clearly defined the scope f the changes.
The concept of ignoring errors was already defined in Transport.
I hooked up the apmsoak CLI flags to set this flag and basically propagate it until here

apm-perf/internal/loadgen/eventhandler/apm.go

Lines 69 to 76 in ef2ed71

    
           if !ignoreErrs { 
        
           	switch res.StatusCode / 100 { 
        
           	case 4: 
        
           		return fmt.Errorf("unexpected client error: %d", res.StatusCode) 
        
           	case 5: 
        
           		return fmt.Errorf("unexpected server error: %d", res.StatusCode) 
        
           	} 
        
           }

This does not mean though that when --ignore-errors is false the application will exit on the first error.

I would like to clarify with the reviewers if we need to keep the existing behavior of failing on the first HTTP error received from the server.
If yes, we need to adjust the code a bit more.

endorama · 2024-02-19T14:03:46Z

@inge4pres I think there 2 sets of errors that we want to address:

HTTP response errors to APM ingestion requests
transient errors in communication between this tool and the ingest endpoint

To my understanding this PR address (1), I think it make sense to expose it through a CLI flag (as --ignore-errors or maybe more specifically --ignore-request-errors).

I think we also need to address (2) which is more detrimental to our testing (es there is a transient client connection timeout) that this PR does not address. I would take care of selecting a flag that does not create confusion between suppressing/ignore case (1) or (2).

Overall for stress testing we need both: (1) as we want to overwhelm the system so blocking on errors is not useful, (2) because transient network errors cannot interrupt the entire benchmark requiring manual intervention.

inge4pres · 2024-02-19T15:11:15Z

@endorama good points 👍🏼

Point 1 is added in latest commit, with a new --ignore-errors flag added.
Point 2 is also what I worked initially, and it's implemented in 94eaf26 by logging transient errors (such as context canceled) instead of returning them.

If you think both points are good to have in this PR, i'll leave it as is.

endorama · 2024-02-20T13:47:13Z

@inge4pres I think it's ok to have both in this PR.

endorama

Changes looks correct to me, the different behavior also aligns with expectations from this tool users. 👍 from me

amannocci

If I understand correctly, the behavior don't change if we don't pass the --ignore-errors option?

cmd/apmsoak/run.go

carsonip · 2024-02-20T17:54:46Z

internal/loadgen/eventhandler/handler.go

-			if s.burst > 0 {
-				s.sent = s.sent % s.burst
-			}
+		case err := <-sendErrs:


when ignoreErrors is true and a signal is received, the sendBatch goroutine in line 264 returns because of <-ctx.Done() and calls line 265 close(sendErrs). then in the main goroutine, both cases in select are ready, and there's a chance that line 291 runs with err == nil. WDYT?

carsonip · 2024-02-20T17:55:16Z

internal/loadgen/eventhandler/handler.go

-			if s.burst > 0 {
-				s.sent = s.sent % s.burst
-			}
+		case err := <-sendErrs:


^ more like a nit, since this code is not critical and only logs a non-error during shutdown

inge4pres · 2024-02-20T18:24:31Z

If I understand correctly, the behavior don't change if we don't pass the --ignore-errors option?

No this is not correct: we do have a change in this PR in the default behavior.
With this PR we never stop on server error and always wait for SIGINT or SIGTERM to shut down the load generation.

Let me know if this is a problem, and we can discuss if we should have a flag to restore the original behavior.

amannocci · 2024-02-20T20:59:37Z

If I understand correctly, the behavior don't change if we don't pass the --ignore-errors option?

No this is not correct: we do have a change in this PR in the default behavior. With this PR we never stop on server error and always wait for SIGINT or SIGTERM to shut down the load generation.

Let me know if this is a problem, and we can discuss if we should have a flag to restore the original behavior.

The apmsoak tool is used to validate a service deployment in an environment before promotion.
Currently, we can detect if a failure occurs in loop and fail the promotion.
If I understand correctly, this change breaks the failure detection.
Is there another way to detect if an environment is instable?

endorama · 2024-02-21T09:06:02Z

I'm sorry, I approved this under the understanding that the default behavior was staying the same and we were adding a flag to change it when needed, but that's not the case.

One thing I'd like to point out is that with respect to point (1) and (2) I made above:

changing behavior of (1) means changing how the tests run by the robots team work and I think we need to maintain the default behavior and put our desired behavior under a flag (so we can use apmsoak to overwhelm an ingestion endpoint
changing behavior of (2) helps also the tests run by the robots team as it safeguard for transient issues not directly relevant to the test, as those errors do not come from the ingestion

inge4pres · 2024-02-21T09:48:38Z

The apmsoak tool is used to validate a service deployment in an environment before promotion.
Currently, we can detect if a failure occurs in loop and fail the promotion.

Aren't we basing the promotion quality gate on SLOs, rather than success/failure of apmsoak?
AFAIU, a reason for flaky quality gate is that on the first non-200 HTTP response, apmsoak would quit and thus the QG would fail.
This flakiness can be addressed with this PR with the addition of the --ignore-errors flag, which would allow to continue running apmsoak even in case of transient failures.

@amannocci do you think we should then add --ignore-errors to QG buildkite jobs to prevent the flakiness?

The other change related to logging the errors instead of returning it becomes redundant I think, and we can reserve it for a future PR - I would still keep the graceful shutdown implementation, but would revert the error handling in SendBtahcesInLoop to what's used to be.

I think the addition of the --ignore-errors flag covers both points, functionally speaking.

amannocci · 2024-02-21T10:02:23Z

The apmsoak tool is used to validate a service deployment in an environment before promotion.
Currently, we can detect if a failure occurs in loop and fail the promotion.

Aren't we basing the promotion quality gate on SLOs, rather than success/failure of apmsoak? AFAIU, a reason for flaky quality gate is that on the first non-200 HTTP response, apmsoak would quit and thus the QG would fail. This flakiness can be addressed with this PR with the addition of the --ignore-errors flag, which would allow to continue running apmsoak even in case of transient failures.

@amannocci do you think we should then add --ignore-errors to QG buildkite jobs to prevent the flakiness?

The other change related to logging the errors instead of returning it becomes redundant I think, and we can reserve it for a future PR - I would still keep the graceful shutdown implementation, but would revert the error handling in SendBtahcesInLoop to what's used to be.

I think the addition of the --ignore-errors flag covers both points, functionally speaking.

The flakiness was due to transient connection timeout on the client side. In that case, it is probably worth to retry or ignore those errors.
However, I don't expect many errors on the server side while using apmsoak.
For instance, I don't expect 4xx errors to occur.
The 5xx errors will be caught by the QG after the run.

simitt · 2024-02-21T10:41:51Z

Aren't we basing the promotion quality gate on SLOs, rather than success/failure of apmsoak?

This is also my understanding. It would be concerning if that is not the case.

However, I don't expect many errors on the server side while using apmsoak.
For instance, I don't expect 4xx errors to occur.
The 5xx errors will be caught by the QG after the run.

Let's not build anything based on assumptions about errors, this probably would bite us in the future.
+1 on having a simple solution here, which IMO is achieved with the additional config option.

v1v · 2024-02-21T12:02:39Z

This is also my understanding. It would be concerning if that is not the case.

I'd like to step back and review how things work at a higher level and the expectations; please bear with me 👼

The APM Managed Gatekeeper tool was built to accommodate the context preparation of where to run the apmsoak.

Any errors related to preparing the context should not fail the promotion but be able to be self-healing and retried.

The context preparation consists of:

Use a serverless project or create one.
Use a K8s cluster
Deploy the apmsoak that targets the above environments (k8s cluster and serverless project)

Afterwards, the apmsoak runs and gets stopped after a specific execution time (1 hour or 8 hours).

@amannocci and anyone else, please feel free to correct me if I said something not accurate enough

Then I've got a few questions about what to do if the apmsoak got stalled and didn't produce any load.

Shall we assume that the monitoring for the apmsoak is something we don't need to care about in the APM Manage Gatekeeper?
If so, will the SLO machinery detect that particular case?
Otherwise, shall we report the failure and let the promotion do the rest (error and reporting)?

Let's not build anything based on assumptions about errors, this probably would bite us in the future.

For clarity and IIUC, those errors are related to the context preparation and reliability on where to run the apmsoak load rather than the apmsoak execution itself.

Again @amannocci, please correct me if I missed something else.

amannocci · 2024-02-21T14:50:29Z

Then I've got a few questions about what to do if the apmsoak got stalled and didn't produce any load.
Shall we assume that the monitoring for the apmsoak is something we don't need to care about in the APM Manage Gatekeeper?

Since we are running the apmsoak tool, we probably need to take care of it within gatekeeper.
That's currently the case and we fail in case something goes wrong.

If so, will the SLO machinery detect that particular case?

If we aren't able to produce a sustained workload then, it depends on the case.
If the apmsoak timeout after 10m and stops working properly then, the quality gate will pass because we don't have any error on the server side.
It's probably the same if the proxy in front of the ingest service fails.

So, letting the apmsoak tool fail in a scenario where it can't recover is a good thing.
However, I'm not sure we want to ignore all errors while interacting with an endpoint.

inge4pres · 2024-02-23T09:50:18Z

Thanks for your inputs folks 🙏🏼
Given the importance apmsoak has in the eco-system, I think it's sensible to proceed by not changing the existing behavior, used today in production by multiple entities.

What I will do with this PR s adapting to the load testing needs we have, by adding the new behavior behind flags, that will have to be explicitly set to affect how apmsoak runs.

One flag is already added, --ignore-errors, defaults to false and can be used to discard non-200 HTTP errors returned by ES (we're going to make us of this in our load tests).
The second flag, yet to be added, will be named --force-shutdown, set to false by default, it will allow to also continue running the process when connection errors are returned by the ES client, only stopping the load generation on process receiving SIGINT/SIGTERM signals.

In this way, the teams consuming apmsoak functionalities today will remain unaffected, and we'll be able to improve our load testing scenarios with the added behaviors.

Signed-off-by: inge4pres <[email protected]>

endorama

Thank you, looks good to me!

I'm approving to speed things up but I'd have a nitpick: can we call it wait or forever instead of force-shutdown? The name sounds confusing to me, while the purpose is clear: do not terminate execution.

inge4pres · 2024-03-05T15:17:59Z

can we call it wait or forever instead of force-shutdown? T

Can do that! Like --run-forever?

cmd/apmsoak/run.go

carsonip · 2024-03-05T15:35:39Z

cmd/apmsoak/run.go

@@ -82,6 +87,8 @@ func NewCmdRun() *cobra.Command {
 	cmd.Flags().StringVar(&options.APIKeys, "api-keys", "", "API keys for managed service. Specify key value pairs as `project_id_1:my_api_key,project_id_2:my_key`")
 	cmd.Flags().BoolVar(&options.BypassProxy, "bypass-proxy", false, "Detach from proxy dependency and provide projectID via header. Useful when testing locally")
 	cmd.Flags().StringVar(&options.Loglevel, "log-level", "info", "Specify the log level to use when running this command. Supported values: debug, info, warn, error")
+	cmd.Flags().BoolVar(&options.IgnoreErrors, "ignore-errors", false, "Do not report as a failure HTTP responses with status code different than 200")
+	cmd.Flags().BoolVar(&options.ForceShutdown, "force-shutdown", false, "Continue running the soak test until a signal is received to stop it")


nit: force-shutdown naming isn't intuitive. I'd expect "force shutdown" to do exactly the opposite of what it actually does. So this is the same comment as Edo, can we have something like "no-exit-on-error"

"run-forever" is quite good

Co-authored-by: Carson Ip <[email protected]>

Signed-off-by: inge4pres <[email protected]>

carsonip

lgtm

inge4pres added the enhancement New feature or request label Feb 15, 2024

inge4pres self-assigned this Feb 15, 2024

inge4pres requested a review from a team as a code owner February 15, 2024 17:53

inge4pres force-pushed the apmsoak/run-until-shutdown branch from 8f22446 to f7b65c6 Compare February 15, 2024 18:02

inge4pres requested a review from a team February 15, 2024 18:09

carsonip reviewed Feb 15, 2024

View reviewed changes

inge4pres force-pushed the apmsoak/run-until-shutdown branch from 0497fba to 500d460 Compare February 16, 2024 14:44

endorama approved these changes Feb 20, 2024

View reviewed changes

amannocci reviewed Feb 20, 2024

View reviewed changes

carsonip reviewed Feb 20, 2024

View reviewed changes

inge4pres added 5 commits February 23, 2024 12:54

add test to event rate parsing

f7259e4

Signed-off-by: inge4pres <[email protected]>

implement graceful shutdownon SIGINT

285b014

Signed-off-by: inge4pres <[email protected]>

exit also on SIGTERM

5c504eb

Signed-off-by: inge4pres <[email protected]>

add CLI flag to ignore errors

1a2d7ff

Signed-off-by: inge4pres <[email protected]>

enable race detection in tests, fix race

22eefbb

Signed-off-by: inge4pres <[email protected]>

refactor: restore default behavior, use dedicated flag for shutdown

805d239

Signed-off-by: inge4pres <[email protected]>

inge4pres force-pushed the apmsoak/run-until-shutdown branch from 3c204fc to 805d239 Compare February 23, 2024 11:54

inge4pres requested review from amannocci and carsonip February 23, 2024 16:41

endorama approved these changes Mar 5, 2024

View reviewed changes

carsonip reviewed Mar 5, 2024

View reviewed changes

inge4pres and others added 2 commits March 5, 2024 18:44

consistent syntax in flag helper

bf7dde3

Co-authored-by: Carson Ip <[email protected]>

refactor: rename flag and fields

bdab176

Signed-off-by: inge4pres <[email protected]>

carsonip approved these changes Mar 5, 2024

View reviewed changes

inge4pres merged commit 4c43a86 into elastic:main Mar 5, 2024
3 checks passed

inge4pres deleted the apmsoak/run-until-shutdown branch March 5, 2024 18:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apmsoak: do not stop on errors, continue until SIGINT #56

apmsoak: do not stop on errors, continue until SIGINT #56

inge4pres commented Feb 15, 2024 •

edited

Loading

carsonip Feb 15, 2024

inge4pres Feb 16, 2024

carsonip Feb 19, 2024

inge4pres Feb 19, 2024

endorama Feb 20, 2024

carsonip Feb 20, 2024

carsonip Feb 20, 2024

inge4pres Feb 20, 2024

adam-stokes commented Feb 15, 2024 •

edited

Loading

inge4pres commented Feb 16, 2024

inge4pres commented Feb 16, 2024 •

edited

Loading

endorama commented Feb 19, 2024

inge4pres commented Feb 19, 2024

endorama commented Feb 20, 2024

endorama left a comment

amannocci left a comment

carsonip Feb 20, 2024

carsonip Feb 20, 2024

inge4pres commented Feb 20, 2024

amannocci commented Feb 20, 2024 •

edited

Loading

endorama commented Feb 21, 2024

inge4pres commented Feb 21, 2024

amannocci commented Feb 21, 2024

simitt commented Feb 21, 2024

v1v commented Feb 21, 2024

amannocci commented Feb 21, 2024

inge4pres commented Feb 23, 2024

endorama left a comment

inge4pres commented Mar 5, 2024

carsonip Mar 5, 2024

carsonip Mar 5, 2024

carsonip left a comment

apmsoak: do not stop on errors, continue until SIGINT #56

apmsoak: do not stop on errors, continue until SIGINT #56

Conversation

inge4pres commented Feb 15, 2024 • edited Loading

Reason for this PR

Details

Further work

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adam-stokes commented Feb 15, 2024 • edited Loading

inge4pres commented Feb 16, 2024

inge4pres commented Feb 16, 2024 • edited Loading

endorama commented Feb 19, 2024

inge4pres commented Feb 19, 2024

endorama commented Feb 20, 2024

endorama left a comment

Choose a reason for hiding this comment

amannocci left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

inge4pres commented Feb 20, 2024

amannocci commented Feb 20, 2024 • edited Loading

endorama commented Feb 21, 2024

inge4pres commented Feb 21, 2024

amannocci commented Feb 21, 2024

simitt commented Feb 21, 2024

v1v commented Feb 21, 2024

amannocci commented Feb 21, 2024

inge4pres commented Feb 23, 2024

endorama left a comment

Choose a reason for hiding this comment

inge4pres commented Mar 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carsonip left a comment

Choose a reason for hiding this comment

inge4pres commented Feb 15, 2024 •

edited

Loading

adam-stokes commented Feb 15, 2024 •

edited

Loading

inge4pres commented Feb 16, 2024 •

edited

Loading

amannocci commented Feb 20, 2024 •

edited

Loading