Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infra Flake Detection Built on Wrong Assumptions #447

Open
antter opened this issue Nov 16, 2021 · 3 comments
Open

Infra Flake Detection Built on Wrong Assumptions #447

antter opened this issue Nov 16, 2021 · 3 comments
Assignees

Comments

@antter
Copy link
Contributor

antter commented Nov 16, 2021

Is your feature request related to a problem? Please describe.
Currently, we are trying to detect "infra flakes" by looking for a waterfall pattern. This might not be as well-motivated as we originally thought. Currently the tests are ordered by most recent failure, meaning that a waterfall pattern could only occur at the beginning, and an infra flake in the middle would look like random tests failing.

Describe the solution you'd like
An updated model that uses some sort of statistical techniques to find infra flakes that doesn't rely on the order that the tests come in.

Additional context
detailed explanation of infra flakes are here: #1

@antter antter self-assigned this Nov 16, 2021
@antter
Copy link
Contributor Author

antter commented Nov 17, 2021

Explanation of a possible solution I'm interested in exploring:

Basically, an infra flake happens when we have a handful of tests fail at a close time unexpectedly. The issue here lies in the word "unexpected". If a test is failing one every 5 times, no failure could be considered "unexpected". What is "unexpected" I feel we can only deduce from looking at a single test's history. It becomes a time series, and I am thinking of making an autoregressive model or moving average model, to capture the fact that more recent previous failures will make a failure more likely for a single test. This way we can sort of quantify unexpectedness.

If we have some sort of baseline for when a test failing is "unexpected", then all that would be left would be to do some analysis to see how well this baseline works, and find a way to identify several unexpected failures happening at once.

All of the above has a decent chance of totally failing though, this is a tough dataset.

One issue that keeps coming up while pondering how to classify infra flakes is that it is hard to decide if a test fails as a direct result of another test failing or both tests fail as a result of an infra flake. The distinction is tough, and maybe not possible with this type of dataset. I'm going to ignore this problem for now.

@antter
Copy link
Contributor Author

antter commented Nov 17, 2021

Also, it may not be all that necessary to make any type of time series model. I think we could get decent results by just simply taking a # failures / # attempts as a metric first. But I do think I'll end up trying both, building off the simple model first. The time series model definitely has potential to be a lot stronger so I'll have to do some sort of comparison at the end.

@antter
Copy link
Contributor Author

antter commented Nov 17, 2021

And FWIW, I don't believe any left-right is necessary for an infra flake. It seems to happen occasionally because infrastructure is flaky in a dynamic way, and a test can pass but fail an hour later because of it. However, it is also the case that infrastructure would have an issue just for one hour, failing many tests, then everything is fine the next time tests come around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants