Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(performance): make tests more deterministic by relying more on system counts #5786

Merged
merged 39 commits into from
Oct 23, 2024

Conversation

Hweinstock
Copy link
Contributor

@Hweinstock Hweinstock commented Oct 15, 2024

Problem

Performance tests are currently flaky, making them a poor signal for performance regressions within the code base. Additionally, false alarms slow down development of unrelated features as test failures keep CI from being "green".

Therefore, rather than relying only on system usage thresholds for performance regressions, we can count the number of high-risk / potentially slow operations made by the code as a deterministic measure of its performance. However, directly counting each individual potentially slow operation used within the performance tests highly couples the test to the implementation details, making the tests less effective if the implementation changes.

Therefore, the goal of this PR is the following:

  1. decrease performance test flakiness by increasing thresholds.
  2. increase performance test effectiveness by relying on deterministic measures.
  3. avoid coupling the tests to the implementation details.

Solution

To meet goal (1), we increase the thresholds of the tests to decrease the changes of a false alarm.

To meet goal (2), we count expensive operations. But, to avoid tying it to the implementation details, we count the expensive operations using somewhat-loose upper bounds. Thus, implementation changes modifying the exact number of expensive operations by a small constant do not set a false alarm. However, if they increase the number of expensive operations by a multiplicative factor, the upper bound will alert us.

As an example, we don't want the test to fail if it makes 5-10 more file system calls when working with a few hundred files, but we do want the test to fail if it makes 2x the number of files system calls. Therefore, in the code the bounds are often described as "operations per file", since it is the multiplicative increase we are concerned about. This allows us to achieve goal (3).

Implementation Details

  • The most common "expensive operation" we count is file system calls (through our fs module). Some other examples include the use of zip libraries or loading large files into memory.
  • We separate the upper bounds for the file system into read and write bounds. This granularity allows us to assert that specific code paths do not modify any files.

Notes


License: I confirm that my contribution is made under the terms of the Apache 2.0 license.

@Hweinstock Hweinstock changed the title tests(performance): make tests more deterministic by relying more on system counts test(performance): make tests more deterministic by relying more on system counts Oct 15, 2024
@justinmk3
Copy link
Contributor

  • The most common "expensive operation" we count is file system calls (through our fs module).

We also have a fetch (http/network calls) abstraction which could be useful in the future.

@@ -28,17 +29,17 @@ export async function prepareRepoData(
repoRootPaths: string[],
workspaceFolders: CurrentWsFolders,
telemetry: TelemetryHelper,
span: Span<AmazonqCreateUpload>
span: Span<AmazonqCreateUpload>,
zip: AdmZip = new AdmZip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: AdmZip will be replaced by zip.js #4769

Copy link
Contributor Author

@Hweinstock Hweinstock Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see. Do you recommend I remove the AdmZip call counting in the meantime or leave it until that other PR is done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went ahead and deleted the AdmZip work. Once a new zip is implemented, this can be augmented to spy on it as needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#4769 is merged now

@Hweinstock
Copy link
Contributor Author

/runIntegrationTests

@Hweinstock Hweinstock marked this pull request as ready for review October 16, 2024 17:34
@Hweinstock Hweinstock requested review from a team as code owners October 16, 2024 17:34
return performanceTest(
{
darwin: {
userCpuUsage: 100,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like a lot of these values are just declared once and then used in darwin/linux/win32. Does it make sense to just have a field called default or something that can be passed in and used as the default for everything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, now that the thresholds are looser I think it makes sense to factor this out to reduce clutter. I wrote a helper function to reduce this.

assert.ok(result)
assert.strictEqual(Buffer.isBuffer(result.zipFileBuffer), true)
assert.strictEqual(telemetry.repositorySize, expectedSize)
assert.strictEqual(result.zipFileChecksum.length, 44)

assert.ok(getFsCallsUpperBound(setup.fsSpy) <= setup.numFiles * 4, 'total system calls should be under 4 per file')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a lot of different numbers that we are comparing the upper bound to. Out of curiosity how did you determine the number to use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If t is the upper bound method (getFsCallsUpperBound), and n is the number of files its working with, I noticed that t(n) is always linear (at least with these tests). So we have something like t(n) = an + c for each test for some values a and c.

To decide on the upper bound for the test, I found the true values of a and c for each test, then basically just took (a+1)n as the upper bound. Usually c is pretty small (<5), so we would really have to increase the number of fs calls on a per file basis to break this bound.

@Hweinstock
Copy link
Contributor Author

/runIntegrationTests

Copy link
Contributor

@jpinkney-aws jpinkney-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking forward to seeing ci green again 😎

afterEach(function () {
sinon.restore()
})
performanceTestWrapper(10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach of simplifying it this way!

@Hweinstock Hweinstock merged commit 16d19c1 into aws:master Oct 23, 2024
30 checks passed
@Hweinstock Hweinstock deleted the pTests/systemSpy branch October 23, 2024 18:09
*/
export function getFsReadsUpperBound(fsSpy: sinon.SinonSpiedInstance<FileSystem>): number {
return (
fsSpy.readFileBytes.callCount +
Copy link
Contributor

@justinmk3 justinmk3 Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought: do we actually need spies or could our fs.ts module just increment a count? Would that save code, and posssibly be more reliable (no test setup required)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants