Negative submission delay #185

szhan · 2024-05-21T08:55:25Z

When looking at the metadata of the samples (reportedly collected in 2020) from the data (version 0.4) from Hunt et al. (2024), I found some instances of negative submission delays.

I then noticed that such cases are not being filtered out before inference (see the code here).

Perhaps we should modify the filtering condition with:

0 <= sample.submission_delay < max_submission_delay

The text was updated successfully, but these errors were encountered:

szhan · 2024-05-21T09:02:07Z

It's not entirely clear to me what the submission dates mean. The submission dates are aggregated from three sources: INSDC, GISAID, and COG UK. I'm not sure if the submission date is the date on which the submitter created a submission entry or the date on which the submitter actually hit the submit button.

szhan · 2024-05-21T09:13:41Z

Just checking some emails, the entries in ENA have first_created and first_public dates. Here we are taking first_created dates as the submission dates.

jeromekelleher · 2024-05-21T09:32:21Z

Probably best to filter negative for now

szhan · 2024-05-21T11:25:57Z

The 2020 trees with and without filtering samples with negative submission delay (n = 633) look very similar overall. I don't see a dramatic decrease in the number of reversions or immediate reversions. Also, the number of recombinants is the same (n = 26).

szhan · 2024-05-21T18:34:10Z

On a related note, I don't think that we are filtering by submission delay with this new Viridian dataset like we did before with the GISAID dataset, because the submission dates are not equivalent. I was just reading Martin's email again, and it seems what we have as submission dates are the dates when submission entries are created on the ENA, not GISAID. I think that it is quite probable that many groups submitted lots of their FastQs to the ENA (much) later than when they submitted their genome sequences to GISAID. This may explain why we filter out so many samples using a threshold of max submission delay of 30 days (as shown above).

I wanted to match the samples in the metadata files that Kat and Martin provided in order to see how different the submission dates are. But Kat's file contains GISAID ids and strain names, whereas Martin's file contains GenBank and ENA accessions.

szhan · 2024-05-23T12:02:26Z

Given the amount of data being excluded using a max submission delay of 30 days, and that the GISAID submission dates and ENA submission dates don't seem equivalent, it makes sense to not rely on submission delay-based filter to exclude probable time travellers. Instead, let's see how much HMM cost will help.

szhan · 2024-07-03T06:32:46Z

I think we have decided to pursue using HMM cost to filter out probable time travellers. There are some signs that the HMM cost strategy helps (e.g., reducing the number of mutations in some 20202 ARGs; see #188). Unless we are going to build ARGs out of GISAID data again, we probably don't need to deal with negative submission delays, so closing this for now.

szhan closed this as completed Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Negative submission delay #185

Negative submission delay #185

szhan commented May 21, 2024 •

edited

Loading

szhan commented May 21, 2024

szhan commented May 21, 2024

jeromekelleher commented May 21, 2024

szhan commented May 21, 2024

szhan commented May 21, 2024

szhan commented May 23, 2024

szhan commented Jul 3, 2024

Negative submission delay #185

Negative submission delay #185

Comments

szhan commented May 21, 2024 • edited Loading

szhan commented May 21, 2024

szhan commented May 21, 2024

jeromekelleher commented May 21, 2024

szhan commented May 21, 2024

szhan commented May 21, 2024

szhan commented May 23, 2024

szhan commented Jul 3, 2024

szhan commented May 21, 2024 •

edited

Loading