Systematic error in how the rolling average of sequences is calculated at beginning of timeline #352

corneliusroemer · 2021-05-04T19:12:35Z

You make a systematic error when calculating rolling averages in the variant graph at the beginning of the time-window.

You always overestimate the variant share at the beginning.

Why? The window starts of being only of length 1 at the beginning, not taking into account the sequences that didn't contain a variant before the first occurence. This biases the variant share systematically up, by quite a lot, causing misleading graphs. This is a real methodological problem.

I can understand why you're doing this, because you only send the data starting from the first occurrence, but this is not ok.

Two ways to fix it: Start the rolling average only at the time when the window is of full length (7 days), or include the 7 days before the first occurence in the calculation of the rolling average.

Either way, this is a high priority issue in my eyes because it is causing a real systematic error that leads to misinterpretation of the data that the naive viewer is not aware of (at least I wasn't until now, having looked at probably a hundred of your graphs, which I love by the way, don't get me wrong!).

Here you can see the problem documented, look at a couple of graphs and you'll notice, it always starts high and goes down, every single black line on every graph. Looking at the numbers shows why, the window starts only on day 1 not day -6 as it should.

gkarthik · 2021-05-04T19:27:55Z

Thank you for raising this issue. I will look into it this week.
As you correctly pointed out, we would need to change the way that window is computed for the first day of detection.

corneliusroemer · 2021-05-04T19:36:25Z

Great, I'll also have a think, happy to discuss.

It'd be 6 extra days you'd need to pull, or just start the line on day 7 as opposed to 1. Which would be the obvious quick fix.

I'd say it may be worth considering doing the quick fix not showing the line for the first 6 days until a permanent solution is found. Otherwise the graphs a systematically biased which is not good for the sake of science, trust etc.

gkarthik · 2021-05-04T20:03:34Z

Yes, I am currently leaning towards 6 extra days before since we want to show the initial date of detection.

corneliusroemer · 2021-05-25T11:16:27Z

Any progress on this? It's a real bug that makes the graphs systematically wrong and makes people draw wrong conclusions - therefore in my view high priority.

I've already submitted a PR that should be able to fix the issue immediately with a fairly high chance.

gkarthik · 2021-05-25T19:51:52Z

Unfortunately lot of things in the pipeline 😓 but this should be resolved in next release.

gkarthik self-assigned this May 4, 2021

gkarthik added the API label May 4, 2021

corneliusroemer linked a pull request May 6, 2021 that will close this issue

WIP: Started rolling mean for lineage and total count 6 days before outbreak-info/outbreak.api#12

Open

flaneuse added the bjorn label Jun 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Systematic error in how the rolling average of sequences is calculated at beginning of timeline #352

Systematic error in how the rolling average of sequences is calculated at beginning of timeline #352

corneliusroemer commented May 4, 2021

gkarthik commented May 4, 2021

corneliusroemer commented May 4, 2021

gkarthik commented May 4, 2021

corneliusroemer commented May 25, 2021

gkarthik commented May 25, 2021

Systematic error in how the rolling average of sequences is calculated at beginning of timeline #352

Systematic error in how the rolling average of sequences is calculated at beginning of timeline #352

Comments

corneliusroemer commented May 4, 2021

gkarthik commented May 4, 2021

corneliusroemer commented May 4, 2021

gkarthik commented May 4, 2021

corneliusroemer commented May 25, 2021

gkarthik commented May 25, 2021