Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Systematic error in how the rolling average of sequences is calculated at beginning of timeline #352

Open
corneliusroemer opened this issue May 4, 2021 · 5 comments · May be fixed by outbreak-info/outbreak.api#12
Assignees

Comments

@corneliusroemer
Copy link

You make a systematic error when calculating rolling averages in the variant graph at the beginning of the time-window.

You always overestimate the variant share at the beginning.

Why? The window starts of being only of length 1 at the beginning, not taking into account the sequences that didn't contain a variant before the first occurence. This biases the variant share systematically up, by quite a lot, causing misleading graphs. This is a real methodological problem.

I can understand why you're doing this, because you only send the data starting from the first occurrence, but this is not ok.

Two ways to fix it: Start the rolling average only at the time when the window is of full length (7 days), or include the 7 days before the first occurence in the calculation of the rolling average.

Either way, this is a high priority issue in my eyes because it is causing a real systematic error that leads to misinterpretation of the data that the naive viewer is not aware of (at least I wasn't until now, having looked at probably a hundred of your graphs, which I love by the way, don't get me wrong!).

Here you can see the problem documented, look at a couple of graphs and you'll notice, it always starts high and goes down, every single black line on every graph. Looking at the numbers shows why, the window starts only on day 1 not day -6 as it should.

image
image
image
image

@gkarthik gkarthik self-assigned this May 4, 2021
@gkarthik
Copy link
Member

gkarthik commented May 4, 2021

Thank you for raising this issue. I will look into it this week.
As you correctly pointed out, we would need to change the way that window is computed for the first day of detection.

@gkarthik gkarthik added the API label May 4, 2021
@corneliusroemer
Copy link
Author

Great, I'll also have a think, happy to discuss.

It'd be 6 extra days you'd need to pull, or just start the line on day 7 as opposed to 1. Which would be the obvious quick fix.

I'd say it may be worth considering doing the quick fix not showing the line for the first 6 days until a permanent solution is found. Otherwise the graphs a systematically biased which is not good for the sake of science, trust etc.

@gkarthik
Copy link
Member

gkarthik commented May 4, 2021

Yes, I am currently leaning towards 6 extra days before since we want to show the initial date of detection.

@corneliusroemer
Copy link
Author

Any progress on this? It's a real bug that makes the graphs systematically wrong and makes people draw wrong conclusions - therefore in my view high priority.

I've already submitted a PR that should be able to fix the issue immediately with a fairly high chance.

@gkarthik
Copy link
Member

Unfortunately lot of things in the pipeline 😓 but this should be resolved in next release.

@flaneuse flaneuse added the bjorn label Jun 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants