-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test(performance): make tests more deterministic by relying more on system counts #5786
Conversation
We also have a fetch (http/network calls) abstraction which could be useful in the future. |
@@ -28,17 +29,17 @@ export async function prepareRepoData( | |||
repoRootPaths: string[], | |||
workspaceFolders: CurrentWsFolders, | |||
telemetry: TelemetryHelper, | |||
span: Span<AmazonqCreateUpload> | |||
span: Span<AmazonqCreateUpload>, | |||
zip: AdmZip = new AdmZip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: AdmZip will be replaced by zip.js #4769
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I see. Do you recommend I remove the AdmZip call counting in the meantime or leave it until that other PR is done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went ahead and deleted the AdmZip work. Once a new zip is implemented, this can be augmented to spy on it as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#4769 is merged now
/runIntegrationTests |
return performanceTest( | ||
{ | ||
darwin: { | ||
userCpuUsage: 100, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like a lot of these values are just declared once and then used in darwin/linux/win32. Does it make sense to just have a field called default or something that can be passed in and used as the default for everything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, now that the thresholds are looser I think it makes sense to factor this out to reduce clutter. I wrote a helper function to reduce this.
assert.ok(result) | ||
assert.strictEqual(Buffer.isBuffer(result.zipFileBuffer), true) | ||
assert.strictEqual(telemetry.repositorySize, expectedSize) | ||
assert.strictEqual(result.zipFileChecksum.length, 44) | ||
|
||
assert.ok(getFsCallsUpperBound(setup.fsSpy) <= setup.numFiles * 4, 'total system calls should be under 4 per file') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a lot of different numbers that we are comparing the upper bound to. Out of curiosity how did you determine the number to use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If t
is the upper bound method (getFsCallsUpperBound
), and n
is the number of files its working with, I noticed that t(n)
is always linear (at least with these tests). So we have something like t(n) = an + c
for each test for some values a
and c
.
To decide on the upper bound for the test, I found the true values of a
and c
for each test, then basically just took (a+1)n
as the upper bound. Usually c
is pretty small (<5), so we would really have to increase the number of fs
calls on a per file basis to break this bound.
/runIntegrationTests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking forward to seeing ci green again 😎
afterEach(function () { | ||
sinon.restore() | ||
}) | ||
performanceTestWrapper(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach of simplifying it this way!
*/ | ||
export function getFsReadsUpperBound(fsSpy: sinon.SinonSpiedInstance<FileSystem>): number { | ||
return ( | ||
fsSpy.readFileBytes.callCount + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought: do we actually need spies or could our fs.ts module just increment a count? Would that save code, and posssibly be more reliable (no test setup required)?
Problem
Performance tests are currently flaky, making them a poor signal for performance regressions within the code base. Additionally, false alarms slow down development of unrelated features as test failures keep CI from being "green".
Therefore, rather than relying only on system usage thresholds for performance regressions, we can count the number of high-risk / potentially slow operations made by the code as a deterministic measure of its performance. However, directly counting each individual potentially slow operation used within the performance tests highly couples the test to the implementation details, making the tests less effective if the implementation changes.
Therefore, the goal of this PR is the following:
Solution
To meet goal (1), we increase the thresholds of the tests to decrease the changes of a false alarm.
To meet goal (2), we count expensive operations. But, to avoid tying it to the implementation details, we count the expensive operations using somewhat-loose upper bounds. Thus, implementation changes modifying the exact number of expensive operations by a small constant do not set a false alarm. However, if they increase the number of expensive operations by a multiplicative factor, the upper bound will alert us.
As an example, we don't want the test to fail if it makes 5-10 more file system calls when working with a few hundred files, but we do want the test to fail if it makes 2x the number of files system calls. Therefore, in the code the bounds are often described as "operations per file", since it is the multiplicative increase we are concerned about. This allows us to achieve goal (3).
Implementation Details
fs
module). Some other examples include the use ofzip
libraries or loading large files into memory.read
andwrite
bounds. This granularity allows us to assert that specific code paths do not modify any files.Notes
AdmZip
removed in refactor: replace archiver with @zip.js/zip.js #4769 , so we don't address spying on it here. Oncezip.js
is implemented, we can spy on it in a follow up.License: I confirm that my contribution is made under the terms of the Apache 2.0 license.