Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apmsoak: do not stop on errors, continue until SIGINT #56

Merged
merged 8 commits into from
Mar 5, 2024

Conversation

inge4pres
Copy link
Contributor

@inge4pres inge4pres commented Feb 15, 2024

Reason for this PR

When ingesting traffic apmsoak will stop running on the first error returned by the server.

Details

  • add testing to parsing the event rate
  • include a signal trap for SIGINT to shutdown all workers

Further work

We should understand with @elastic/observablt-robots if there's an impact to quality gates with this change.

@inge4pres inge4pres added the enhancement New feature or request label Feb 15, 2024
@inge4pres inge4pres self-assigned this Feb 15, 2024
@inge4pres inge4pres requested a review from a team as a code owner February 15, 2024 17:53
@inge4pres inge4pres requested a review from a team February 15, 2024 18:09
cmd/apmsoak/run.go Outdated Show resolved Hide resolved
if s.burst > 0 {
s.sent = s.sent % s.burst
}
case err := <-sendErrs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will race happen and accidentally enter this branch when we call close(sendErrs)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no because the channel is closed by the producer above on context cancelation, so at the same time the first case branch in this select is chose. also consider that select has case ordering since 1.15 IIRC, this means that it will enter more likely the first branch if it finds both conditions can be satisfied.
can write a test for it if you think it's something we want to ensure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select has case ordering since 1.15 IIRC

There is no case ordering. It is non-deterministic, see playground https://go.dev/play/p/Ic9i8MMKMQF

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure there is a preference on the default case, and IIRC also when 2 select cases are ready simultaneously - it used to be more random but they changed it some time ago.

I am not able to point at the source right now, but the spec says
"the channel and right-hand-side expressions of send statements are evaluated exactly once, in source order, upon entering the "select" statement.".
Anyways, we may be able to remove this block, see #56 (comment)
If we use ignoreErrors as originally planned, we can discard this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that we can discard this as we are proceeding as mentioned in the linked comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when ignoreErrors is true and a signal is received, the sendBatch goroutine in line 264 returns because of <-ctx.Done() and calls line 265 close(sendErrs). then in the main goroutine, both cases in select are ready, and there's a chance that line 291 runs with err == nil. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ more like a nit, since this code is not critical and only logs a non-error during shutdown

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when ignoreErrors is true

yeah I see I created confusion here, apologies - when IgnoreErrors is true, sendBatch never returns an error, so the if condition is redundant, see Transport.SendEvents.

the sendBatch goroutine in line 264 returns because of <-ctx.Done()

when the context is canceled, the error may end up in sendErrs, yes - err is not nil, it's context.Canceled, but the gorotuine either takes the first branch of select (and exits) or the message ends up in the channel, and then in the next iteration the channel is closed after return.
we may end up logging a spurious context canceled error, but note that at same time context is canceled, the outer gorotuine with the select at L#286 also returns - so it's very unlikely we hit that log line.
that's also the reason why the channel has a buffer of 1

@adam-stokes
Copy link

adam-stokes commented Feb 15, 2024

Misread this description, yes this is the behavior @elastic/observablt-pf want to keep, Thanks

@inge4pres
Copy link
Contributor Author

this is the behavior @elastic/observablt-pf want to keep,

You mean failing on first error is the behavior you want to keep?
If yes, would be ok for you to have he default behavior to not stop on errors and add a --fail-on-error flag that would fail on the first error?
credit to @endorama for suggesting this ⬆️

@inge4pres
Copy link
Contributor Author

inge4pres commented Feb 16, 2024

The PR has been updated with 2 commits but I am not 100% sure we have clearly defined the scope f the changes.
The concept of ignoring errors was already defined in Transport.
I hooked up the apmsoak CLI flags to set this flag and basically propagate it until here

if !ignoreErrs {
switch res.StatusCode / 100 {
case 4:
return fmt.Errorf("unexpected client error: %d", res.StatusCode)
case 5:
return fmt.Errorf("unexpected server error: %d", res.StatusCode)
}
}

This does not mean though that when --ignore-errors is false the application will exit on the first error.

I would like to clarify with the reviewers if we need to keep the existing behavior of failing on the first HTTP error received from the server.
If yes, we need to adjust the code a bit more.

@endorama
Copy link
Member

@inge4pres I think there 2 sets of errors that we want to address:

  1. HTTP response errors to APM ingestion requests
  2. transient errors in communication between this tool and the ingest endpoint

To my understanding this PR address (1), I think it make sense to expose it through a CLI flag (as --ignore-errors or maybe more specifically --ignore-request-errors).

I think we also need to address (2) which is more detrimental to our testing (es there is a transient client connection timeout) that this PR does not address. I would take care of selecting a flag that does not create confusion between suppressing/ignore case (1) or (2).

Overall for stress testing we need both: (1) as we want to overwhelm the system so blocking on errors is not useful, (2) because transient network errors cannot interrupt the entire benchmark requiring manual intervention.

@inge4pres
Copy link
Contributor Author

@endorama good points 👍🏼

Point 1 is added in latest commit, with a new --ignore-errors flag added.
Point 2 is also what I worked initially, and it's implemented in 94eaf26 by logging transient errors (such as context canceled) instead of returning them.

If you think both points are good to have in this PR, i'll leave it as is.

@endorama
Copy link
Member

@inge4pres I think it's ok to have both in this PR.

Copy link
Member

@endorama endorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes looks correct to me, the different behavior also aligns with expectations from this tool users. 👍 from me

Copy link
Contributor

@amannocci amannocci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the behavior don't change if we don't pass the --ignore-errors option?

cmd/apmsoak/run.go Outdated Show resolved Hide resolved
if s.burst > 0 {
s.sent = s.sent % s.burst
}
case err := <-sendErrs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when ignoreErrors is true and a signal is received, the sendBatch goroutine in line 264 returns because of <-ctx.Done() and calls line 265 close(sendErrs). then in the main goroutine, both cases in select are ready, and there's a chance that line 291 runs with err == nil. WDYT?

if s.burst > 0 {
s.sent = s.sent % s.burst
}
case err := <-sendErrs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ more like a nit, since this code is not critical and only logs a non-error during shutdown

@inge4pres
Copy link
Contributor Author

If I understand correctly, the behavior don't change if we don't pass the --ignore-errors option?

No this is not correct: we do have a change in this PR in the default behavior.
With this PR we never stop on server error and always wait for SIGINT or SIGTERM to shut down the load generation.

Let me know if this is a problem, and we can discuss if we should have a flag to restore the original behavior.

@amannocci
Copy link
Contributor

amannocci commented Feb 20, 2024

If I understand correctly, the behavior don't change if we don't pass the --ignore-errors option?

No this is not correct: we do have a change in this PR in the default behavior. With this PR we never stop on server error and always wait for SIGINT or SIGTERM to shut down the load generation.

Let me know if this is a problem, and we can discuss if we should have a flag to restore the original behavior.

The apmsoak tool is used to validate a service deployment in an environment before promotion.
Currently, we can detect if a failure occurs in loop and fail the promotion.
If I understand correctly, this change breaks the failure detection.
Is there another way to detect if an environment is instable?

@endorama
Copy link
Member

I'm sorry, I approved this under the understanding that the default behavior was staying the same and we were adding a flag to change it when needed, but that's not the case.

One thing I'd like to point out is that with respect to point (1) and (2) I made above:

  • changing behavior of (1) means changing how the tests run by the robots team work and I think we need to maintain the default behavior and put our desired behavior under a flag (so we can use apmsoak to overwhelm an ingestion endpoint
  • changing behavior of (2) helps also the tests run by the robots team as it safeguard for transient issues not directly relevant to the test, as those errors do not come from the ingestion

@inge4pres
Copy link
Contributor Author

The apmsoak tool is used to validate a service deployment in an environment before promotion.
Currently, we can detect if a failure occurs in loop and fail the promotion.

Aren't we basing the promotion quality gate on SLOs, rather than success/failure of apmsoak?
AFAIU, a reason for flaky quality gate is that on the first non-200 HTTP response, apmsoak would quit and thus the QG would fail.
This flakiness can be addressed with this PR with the addition of the --ignore-errors flag, which would allow to continue running apmsoak even in case of transient failures.

@amannocci do you think we should then add --ignore-errors to QG buildkite jobs to prevent the flakiness?

The other change related to logging the errors instead of returning it becomes redundant I think, and we can reserve it for a future PR - I would still keep the graceful shutdown implementation, but would revert the error handling in SendBtahcesInLoop to what's used to be.

I think the addition of the --ignore-errors flag covers both points, functionally speaking.

@amannocci
Copy link
Contributor

The apmsoak tool is used to validate a service deployment in an environment before promotion.
Currently, we can detect if a failure occurs in loop and fail the promotion.

Aren't we basing the promotion quality gate on SLOs, rather than success/failure of apmsoak? AFAIU, a reason for flaky quality gate is that on the first non-200 HTTP response, apmsoak would quit and thus the QG would fail. This flakiness can be addressed with this PR with the addition of the --ignore-errors flag, which would allow to continue running apmsoak even in case of transient failures.

@amannocci do you think we should then add --ignore-errors to QG buildkite jobs to prevent the flakiness?

The other change related to logging the errors instead of returning it becomes redundant I think, and we can reserve it for a future PR - I would still keep the graceful shutdown implementation, but would revert the error handling in SendBtahcesInLoop to what's used to be.

I think the addition of the --ignore-errors flag covers both points, functionally speaking.

The flakiness was due to transient connection timeout on the client side. In that case, it is probably worth to retry or ignore those errors.
However, I don't expect many errors on the server side while using apmsoak.
For instance, I don't expect 4xx errors to occur.
The 5xx errors will be caught by the QG after the run.

@simitt
Copy link

simitt commented Feb 21, 2024

Aren't we basing the promotion quality gate on SLOs, rather than success/failure of apmsoak?

This is also my understanding. It would be concerning if that is not the case.

However, I don't expect many errors on the server side while using apmsoak.
For instance, I don't expect 4xx errors to occur.
The 5xx errors will be caught by the QG after the run.

Let's not build anything based on assumptions about errors, this probably would bite us in the future.
+1 on having a simple solution here, which IMO is achieved with the additional config option.

@v1v
Copy link
Member

v1v commented Feb 21, 2024

This is also my understanding. It would be concerning if that is not the case.

I'd like to step back and review how things work at a higher level and the expectations; please bear with me 👼

The APM Managed Gatekeeper tool was built to accommodate the context preparation of where to run the apmsoak.

Any errors related to preparing the context should not fail the promotion but be able to be self-healing and retried.

The context preparation consists of:

  1. Use a serverless project or create one.
  2. Use a K8s cluster
  3. Deploy the apmsoak that targets the above environments (k8s cluster and serverless project)

Afterwards, the apmsoak runs and gets stopped after a specific execution time (1 hour or 8 hours).

@amannocci and anyone else, please feel free to correct me if I said something not accurate enough

Then I've got a few questions about what to do if the apmsoak got stalled and didn't produce any load.

  • Shall we assume that the monitoring for the apmsoak is something we don't need to care about in the APM Manage Gatekeeper?
  • If so, will the SLO machinery detect that particular case?
  • Otherwise, shall we report the failure and let the promotion do the rest (error and reporting)?

Let's not build anything based on assumptions about errors, this probably would bite us in the future.

For clarity and IIUC, those errors are related to the context preparation and reliability on where to run the apmsoak load rather than the apmsoak execution itself.

Again @amannocci, please correct me if I missed something else.


@amannocci
Copy link
Contributor

Then I've got a few questions about what to do if the apmsoak got stalled and didn't produce any load.
Shall we assume that the monitoring for the apmsoak is something we don't need to care about in the APM Manage Gatekeeper?

Since we are running the apmsoak tool, we probably need to take care of it within gatekeeper.
That's currently the case and we fail in case something goes wrong.

If so, will the SLO machinery detect that particular case?

If we aren't able to produce a sustained workload then, it depends on the case.
If the apmsoak timeout after 10m and stops working properly then, the quality gate will pass because we don't have any error on the server side.
It's probably the same if the proxy in front of the ingest service fails.

So, letting the apmsoak tool fail in a scenario where it can't recover is a good thing.
However, I'm not sure we want to ignore all errors while interacting with an endpoint.

@inge4pres
Copy link
Contributor Author

Thanks for your inputs folks 🙏🏼
Given the importance apmsoak has in the eco-system, I think it's sensible to proceed by not changing the existing behavior, used today in production by multiple entities.

What I will do with this PR s adapting to the load testing needs we have, by adding the new behavior behind flags, that will have to be explicitly set to affect how apmsoak runs.

One flag is already added, --ignore-errors, defaults to false and can be used to discard non-200 HTTP errors returned by ES (we're going to make us of this in our load tests).
The second flag, yet to be added, will be named --force-shutdown, set to false by default, it will allow to also continue running the process when connection errors are returned by the ES client, only stopping the load generation on process receiving SIGINT/SIGTERM signals.

In this way, the teams consuming apmsoak functionalities today will remain unaffected, and we'll be able to improve our load testing scenarios with the added behaviors.

Copy link
Member

@endorama endorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, looks good to me!

I'm approving to speed things up but I'd have a nitpick: can we call it wait or forever instead of force-shutdown? The name sounds confusing to me, while the purpose is clear: do not terminate execution.

@inge4pres
Copy link
Contributor Author

can we call it wait or forever instead of force-shutdown? T

Can do that! Like --run-forever?

cmd/apmsoak/run.go Outdated Show resolved Hide resolved
@@ -82,6 +87,8 @@ func NewCmdRun() *cobra.Command {
cmd.Flags().StringVar(&options.APIKeys, "api-keys", "", "API keys for managed service. Specify key value pairs as `project_id_1:my_api_key,project_id_2:my_key`")
cmd.Flags().BoolVar(&options.BypassProxy, "bypass-proxy", false, "Detach from proxy dependency and provide projectID via header. Useful when testing locally")
cmd.Flags().StringVar(&options.Loglevel, "log-level", "info", "Specify the log level to use when running this command. Supported values: debug, info, warn, error")
cmd.Flags().BoolVar(&options.IgnoreErrors, "ignore-errors", false, "Do not report as a failure HTTP responses with status code different than 200")
cmd.Flags().BoolVar(&options.ForceShutdown, "force-shutdown", false, "Continue running the soak test until a signal is received to stop it")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: force-shutdown naming isn't intuitive. I'd expect "force shutdown" to do exactly the opposite of what it actually does. So this is the same comment as Edo, can we have something like "no-exit-on-error"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"run-forever" is quite good

Copy link
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@inge4pres inge4pres merged commit 4c43a86 into elastic:main Mar 5, 2024
3 checks passed
@inge4pres inge4pres deleted the apmsoak/run-until-shutdown branch March 5, 2024 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants