-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apmsoak: do not stop on errors, continue until SIGINT #56
Conversation
8f22446
to
f7b65c6
Compare
if s.burst > 0 { | ||
s.sent = s.sent % s.burst | ||
} | ||
case err := <-sendErrs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will race happen and accidentally enter this branch when we call close(sendErrs)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no because the channel is closed by the producer above on context cancelation, so at the same time the first case branch in this select is chose. also consider that select has case ordering since 1.15 IIRC, this means that it will enter more likely the first branch if it finds both conditions can be satisfied.
can write a test for it if you think it's something we want to ensure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
select has case ordering since 1.15 IIRC
There is no case ordering. It is non-deterministic, see playground https://go.dev/play/p/Ic9i8MMKMQF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty sure there is a preference on the default
case, and IIRC also when 2 select cases are ready simultaneously - it used to be more random but they changed it some time ago.
I am not able to point at the source right now, but the spec says
"the channel and right-hand-side expressions of send statements are evaluated exactly once, in source order, upon entering the "select" statement.".
Anyways, we may be able to remove this block, see #56 (comment)
If we use ignoreErrors
as originally planned, we can discard this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that we can discard this as we are proceeding as mentioned in the linked comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when ignoreErrors is true and a signal is received, the sendBatch goroutine in line 264 returns because of <-ctx.Done() and calls line 265 close(sendErrs). then in the main goroutine, both cases in select are ready, and there's a chance that line 291 runs with err == nil. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ more like a nit, since this code is not critical and only logs a non-error during shutdown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when ignoreErrors is true
yeah I see I created confusion here, apologies - when IgnoreErrors is true, sendBatch never returns an error, so the if condition is redundant, see Transport.SendEvents
.
the sendBatch goroutine in line 264 returns because of <-ctx.Done()
when the context is canceled, the error may end up in sendErrs, yes - err is not nil, it's context.Canceled, but the gorotuine either takes the first branch of select (and exits) or the message ends up in the channel, and then in the next iteration the channel is closed after return.
we may end up logging a spurious context canceled error, but note that at same time context is canceled, the outer gorotuine with the select at L#286 also returns - so it's very unlikely we hit that log line.
that's also the reason why the channel has a buffer of 1
Misread this description, yes this is the behavior @elastic/observablt-pf want to keep, Thanks |
You mean failing on first error is the behavior you want to keep? |
0497fba
to
500d460
Compare
The PR has been updated with 2 commits but I am not 100% sure we have clearly defined the scope f the changes. apm-perf/internal/loadgen/eventhandler/apm.go Lines 69 to 76 in ef2ed71
This does not mean though that when I would like to clarify with the reviewers if we need to keep the existing behavior of failing on the first HTTP error received from the server. |
@inge4pres I think there 2 sets of errors that we want to address:
To my understanding this PR address (1), I think it make sense to expose it through a CLI flag (as I think we also need to address (2) which is more detrimental to our testing (es there is a transient client connection timeout) that this PR does not address. I would take care of selecting a flag that does not create confusion between suppressing/ignore case (1) or (2). Overall for stress testing we need both: (1) as we want to overwhelm the system so blocking on errors is not useful, (2) because transient network errors cannot interrupt the entire benchmark requiring manual intervention. |
@endorama good points 👍🏼 Point 1 is added in latest commit, with a new If you think both points are good to have in this PR, i'll leave it as is. |
@inge4pres I think it's ok to have both in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes looks correct to me, the different behavior also aligns with expectations from this tool users. 👍 from me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, the behavior don't change if we don't pass the --ignore-errors
option?
if s.burst > 0 { | ||
s.sent = s.sent % s.burst | ||
} | ||
case err := <-sendErrs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when ignoreErrors is true and a signal is received, the sendBatch goroutine in line 264 returns because of <-ctx.Done() and calls line 265 close(sendErrs). then in the main goroutine, both cases in select are ready, and there's a chance that line 291 runs with err == nil. WDYT?
if s.burst > 0 { | ||
s.sent = s.sent % s.burst | ||
} | ||
case err := <-sendErrs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ more like a nit, since this code is not critical and only logs a non-error during shutdown
No this is not correct: we do have a change in this PR in the default behavior. Let me know if this is a problem, and we can discuss if we should have a flag to restore the original behavior. |
The apmsoak tool is used to validate a service deployment in an environment before promotion. |
I'm sorry, I approved this under the understanding that the default behavior was staying the same and we were adding a flag to change it when needed, but that's not the case. One thing I'd like to point out is that with respect to point (1) and (2) I made above:
|
Aren't we basing the promotion quality gate on SLOs, rather than success/failure of apmsoak? @amannocci do you think we should then add The other change related to logging the errors instead of returning it becomes redundant I think, and we can reserve it for a future PR - I would still keep the graceful shutdown implementation, but would revert the error handling in I think the addition of the |
The flakiness was due to transient connection timeout on the client side. In that case, it is probably worth to retry or ignore those errors. |
This is also my understanding. It would be concerning if that is not the case.
Let's not build anything based on assumptions about errors, this probably would bite us in the future. |
I'd like to step back and review how things work at a higher level and the expectations; please bear with me 👼 The APM Managed Gatekeeper tool was built to accommodate the context preparation of where to run the Any errors related to preparing the context should not fail the promotion but be able to be self-healing and retried. The context preparation consists of:
Afterwards, the @amannocci and anyone else, please feel free to correct me if I said something not accurate enough Then I've got a few questions about what to do if the
For clarity and IIUC, those errors are related to the context preparation and reliability on where to run the apmsoak load rather than the apmsoak execution itself. Again @amannocci, please correct me if I missed something else. |
Since we are running the
If we aren't able to produce a sustained workload then, it depends on the case. So, letting the apmsoak tool fail in a scenario where it can't recover is a good thing. |
Thanks for your inputs folks 🙏🏼 What I will do with this PR s adapting to the load testing needs we have, by adding the new behavior behind flags, that will have to be explicitly set to affect how One flag is already added, In this way, the teams consuming |
Signed-off-by: inge4pres <[email protected]>
Signed-off-by: inge4pres <[email protected]>
Signed-off-by: inge4pres <[email protected]>
Signed-off-by: inge4pres <[email protected]>
Signed-off-by: inge4pres <[email protected]>
Signed-off-by: inge4pres <[email protected]>
3c204fc
to
805d239
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, looks good to me!
I'm approving to speed things up but I'd have a nitpick: can we call it wait
or forever
instead of force-shutdown
? The name sounds confusing to me, while the purpose is clear: do not terminate execution.
Can do that! Like |
cmd/apmsoak/run.go
Outdated
@@ -82,6 +87,8 @@ func NewCmdRun() *cobra.Command { | |||
cmd.Flags().StringVar(&options.APIKeys, "api-keys", "", "API keys for managed service. Specify key value pairs as `project_id_1:my_api_key,project_id_2:my_key`") | |||
cmd.Flags().BoolVar(&options.BypassProxy, "bypass-proxy", false, "Detach from proxy dependency and provide projectID via header. Useful when testing locally") | |||
cmd.Flags().StringVar(&options.Loglevel, "log-level", "info", "Specify the log level to use when running this command. Supported values: debug, info, warn, error") | |||
cmd.Flags().BoolVar(&options.IgnoreErrors, "ignore-errors", false, "Do not report as a failure HTTP responses with status code different than 200") | |||
cmd.Flags().BoolVar(&options.ForceShutdown, "force-shutdown", false, "Continue running the soak test until a signal is received to stop it") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: force-shutdown
naming isn't intuitive. I'd expect "force shutdown" to do exactly the opposite of what it actually does. So this is the same comment as Edo, can we have something like "no-exit-on-error"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"run-forever" is quite good
Co-authored-by: Carson Ip <[email protected]>
Signed-off-by: inge4pres <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Reason for this PR
When ingesting traffic
apmsoak
will stop running on the first error returned by the server.Details
Further work
We should understand with @elastic/observablt-robots if there's an impact to quality gates with this change.