-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Negative submission delay #185
Comments
It's not entirely clear to me what the submission dates mean. The submission dates are aggregated from three sources: INSDC, GISAID, and COG UK. I'm not sure if the submission date is the date on which the submitter created a submission entry or the date on which the submitter actually hit the submit button. |
Just checking some emails, the entries in ENA have |
Probably best to filter negative for now |
The 2020 trees with and without filtering samples with negative submission delay (n = 633) look very similar overall. I don't see a dramatic decrease in the number of reversions or immediate reversions. Also, the number of recombinants is the same (n = 26). |
On a related note, I don't think that we are filtering by submission delay with this new Viridian dataset like we did before with the GISAID dataset, because the submission dates are not equivalent. I was just reading Martin's email again, and it seems what we have as submission dates are the dates when submission entries are created on the ENA, not GISAID. I think that it is quite probable that many groups submitted lots of their FastQs to the ENA (much) later than when they submitted their genome sequences to GISAID. This may explain why we filter out so many samples using a threshold of max submission delay of 30 days (as shown above). I wanted to match the samples in the metadata files that Kat and Martin provided in order to see how different the submission dates are. But Kat's file contains GISAID ids and strain names, whereas Martin's file contains GenBank and ENA accessions. |
Given the amount of data being excluded using a max submission delay of 30 days, and that the GISAID submission dates and ENA submission dates don't seem equivalent, it makes sense to not rely on submission delay-based filter to exclude probable time travellers. Instead, let's see how much HMM cost will help. |
I think we have decided to pursue using HMM cost to filter out probable time travellers. There are some signs that the HMM cost strategy helps (e.g., reducing the number of mutations in some 20202 ARGs; see #188). Unless we are going to build ARGs out of GISAID data again, we probably don't need to deal with negative submission delays, so closing this for now. |
When looking at the metadata of the samples (reportedly collected in 2020) from the data (version 0.4) from Hunt et al. (2024), I found some instances of negative submission delays.
I then noticed that such cases are not being filtered out before inference (see the code here).
Perhaps we should modify the filtering condition with:
0 <= sample.submission_delay < max_submission_delay
The text was updated successfully, but these errors were encountered: