Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load Tests Against Azure Environment #1122

Open
14 of 22 tasks
JohnNKing opened this issue May 30, 2024 · 13 comments
Open
14 of 22 tasks

Load Tests Against Azure Environment #1122

JohnNKing opened this issue May 30, 2024 · 13 comments
Assignees
Labels
devex/opex A development excellence or operational excellence backlog item. Stream 1

Comments

@JohnNKing
Copy link
Contributor

JohnNKing commented May 30, 2024

DevEx/OpEx

Currently, load tests are only run locally. Obtain more realistic data on TI service performance by running the tests within Azure.

Additional context from the eng-tasks channel:
Set up load tests to run periodically in Azure. Possibly in the Flexion environment first for cost estimation purposes. Eventually having a plan to move them where needed.

Tasks

  • Sign Flexion and CDC up for the Locust preview program.
  • Implement load test in internal
    • Get a basic load test working.
    • Successfully authenticate against the token endpoint.
    • Load test the other endpoints.
    • Make load test continues to work for longer durations.
    • Ensure load test continues to work locally.
  • Set up a GH Action to run tests on a schedule. - @jherrflexion
  • Update existing ADR and/or create new one. - @jherrflexion
  • Update readme with how to run and modify load tests in deployed envs
  • Update Readme about setting the per-thread RS implementation when a new API call is created that calls RS.
  • Configure test criteria (failure thresholds) in Azure (could be different than existing Locust file's thresholds). - @hal
  • Add some additional comments to app context and places where we to .getImplementation instead of @Inject. - @halprin
  • Figure out why metadata endpoint is failing in load test.
  • Look into why failing to link metadata results in 5XXs. This was caused by a manual column rename just in the Internal env as part of another card and the change had not yet been undone. Reverting the change fixed the 5xx errors
  • Optional: look into why our deployed environments are slow for the orders and results endpoint (is it really due to RS staging being slow?). We removed the RS URL env var in Internal (causes it to use a mock RS endpoint) and then responses were fast. This suggests the latency is largely around interacting with RS
  • Brainstorm on how we want to handle sending large amounts of messages to ReportStream. Because we send large amounts of data to TI, we will send large amounts of data to RS. Should we let RS know before we do this? Should we just hope it isn't a problem until they yell at us? Should we edit our code so it doesn't actually send data to RS while load testing (this is extra problematic because we want to load test our application in a realistic manner). Since RS appears to be the major source of latency and variability in our load tests, we intend to use a mock RS endpoint so we can identify/isolate issues within TI
    • Set up mock RS endpoints for load testing so we can get a consistent, predictable delay. Maybe something like a half second so it's not huge, but also would point out if e.g. things are running in sequence that should be in parallel and the delays are stacking up
    • If we do this, we need to do it in a way that still allows regular traffic to work - maybe a header or flag of some kind? Don't want to need to do a manual deploy and shut down all other work in an env just to do load testing
  • Since we are relegated to click-ops (no TF), rename internal load testing to be better than jeff-load-test. - @halprin
  • Look at why load tests cause 499s again.
  • Set-up load test for staging.

Additional Context

Load testing Internal is fine, but the real benefit is load testing Staging because it is the most like Production. But testing Staging may be "hard" given the access restrictions. We should investigate what it takes to get around those restrictions for load testing. Alternatively, if it takes a sufficiently long time to figure that out without any progress, we could consider changing Internal's configuration to be more Production like when doing load testing.

@JohnNKing JohnNKing added the devex/opex A development excellence or operational excellence backlog item. label May 30, 2024
@jcrichlake
Copy link
Contributor

In order to utilize our existing load tests we need to sign up for the azure locust preview test program. This will take a few days to be completed.

Adding the blocked tag to this until we hear back

@brick-green
Copy link

@jcrichlake Who are we waiting to hear back from?

@halprin halprin self-assigned this Dec 16, 2024
@halprin
Copy link
Member

halprin commented Dec 16, 2024

we got past the current issue of the load tests. The problem was we were missing https:// (i.e. the protocol) for LOCUST_HOST. The load test successfully hit Internal.

We can probably start adding back the other parts of the load test.

We also need to figure out how to authenticate agains the token endpoint.

There seems to be a section for secrets in the load test config, so perhaps we could use it to fill in a JWT or perhaps the private key that is used to sign JWTs.

Updated this ticket with additional tasks and context.

@halprin
Copy link
Member

halprin commented Dec 17, 2024

I got the original load test file working in Azure (with some modifications) which includes authenticating against the token endpoint and hitting the other endpoints. I got the secrets working with the JWT.

We've only ever done short load tests, so I will experiment with longer load tests and make sure the Azure load test framework continues to work.

@halprin
Copy link
Member

halprin commented Dec 17, 2024

@somesylvie, is the task Look into why failing to link metadata results in 5XXs. complete since you fixed our database in Internal? Your load test and my subsequent larger load test seems to have 0 errors.

@halprin
Copy link
Member

halprin commented Dec 17, 2024

@somesylvie, I ran a larger test after your latest one (description of Bigger test). I ran 2 engines, each with 50 users, for 10 minutes. It passed with flying colors.

What are your thoughts on marking the task Optional: look into why our deployed environments are slow for the orders and results endpoint (is it really due to RS staging being slow?). complete? Based on your last test and my larger test, the slowness seems to be due to RS. Thoughts?

@halprin
Copy link
Member

halprin commented Dec 17, 2024

@somesylvie, I made an additional commit to our branch azure-load-tests that messes with a thread local ApplicationContext. I haven't tested it. You can see the differences when comparing against main.

@somesylvie
Copy link
Contributor

@somesylvie, is the task Look into why failing to link metadata results in 5XXs. complete since you fixed our database in Internal? Your load test and my subsequent larger load test seems to have 0 errors.

Yep, I think this task is completed/resolved

@somesylvie
Copy link
Contributor

@somesylvie, I ran a larger test after your latest one (description of Bigger test). I ran 2 engines, each with 50 users, for 10 minutes. It passed with flying colors.

What are your thoughts on marking the task Optional: look into why our deployed environments are slow for the orders and results endpoint (is it really due to RS staging being slow?). complete? Based on your last test and my larger test, the slowness seems to be due to RS. Thoughts?

Yeah, this seems plausible. If we start running into timing issues in future, we can assess them then

@somesylvie somesylvie mentioned this issue Dec 18, 2024
4 tasks
@pluckyswan
Copy link
Contributor

We are closing out an existing card regarding load tests against staging. Can this be included in this card's scope: We may need to re-examine our load test assumptions, we are currently ramping up to 1k users and then hammering the various endpoints, this may be an amount of load that is either far off in the future or will require multiple instances of the intermediary running to handle it.

@halprin
Copy link
Member

halprin commented Jan 2, 2025

We are closing out an existing card regarding load tests against staging. Can this be included in this card's scope: We may need to re-examine our load test assumptions, we are currently ramping up to 1k users and then hammering the various endpoints, this may be an amount of load that is either far off in the future or will require multiple instances of the intermediary running to handle it.

Definitely, it will be included. We'll be figuring out what is a good level of load now that we are hitting real environments. Whenever your changing the load test to hit a different environment (in this case, staging versus local), one should always re-evaluate what is a good, realistic traffic load.

@jcrichlake
Copy link
Contributor

Two questions I have on this card.

Does Flexion need to research a way to charge the CDC for the cost associated with the load testing? Pricing can be found at the below link:

https://azure.microsoft.com/en-us/pricing/details/load-testing/

@brick-green
Copy link

Cannot charge to CDC as this is in Internal flexion infrastructure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devex/opex A development excellence or operational excellence backlog item. Stream 1
Projects
None yet
Development

No branches or pull requests

6 participants