Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative submission delay #185

Closed
szhan opened this issue May 21, 2024 · 7 comments
Closed

Negative submission delay #185

szhan opened this issue May 21, 2024 · 7 comments

Comments

@szhan
Copy link
Contributor

szhan commented May 21, 2024

When looking at the metadata of the samples (reportedly collected in 2020) from the data (version 0.4) from Hunt et al. (2024), I found some instances of negative submission delays.

Screenshot 2024-05-21 at 09 09 16

I then noticed that such cases are not being filtered out before inference (see the code here).

Perhaps we should modify the filtering condition with:

0 <= sample.submission_delay < max_submission_delay

@szhan
Copy link
Contributor Author

szhan commented May 21, 2024

It's not entirely clear to me what the submission dates mean. The submission dates are aggregated from three sources: INSDC, GISAID, and COG UK. I'm not sure if the submission date is the date on which the submitter created a submission entry or the date on which the submitter actually hit the submit button.

@szhan
Copy link
Contributor Author

szhan commented May 21, 2024

Just checking some emails, the entries in ENA have first_created and first_public dates. Here we are taking first_created dates as the submission dates.

@jeromekelleher
Copy link
Owner

Probably best to filter negative for now

@szhan
Copy link
Contributor Author

szhan commented May 21, 2024

The 2020 trees with and without filtering samples with negative submission delay (n = 633) look very similar overall. I don't see a dramatic decrease in the number of reversions or immediate reversions. Also, the number of recombinants is the same (n = 26).

@szhan
Copy link
Contributor Author

szhan commented May 21, 2024

On a related note, I don't think that we are filtering by submission delay with this new Viridian dataset like we did before with the GISAID dataset, because the submission dates are not equivalent. I was just reading Martin's email again, and it seems what we have as submission dates are the dates when submission entries are created on the ENA, not GISAID. I think that it is quite probable that many groups submitted lots of their FastQs to the ENA (much) later than when they submitted their genome sequences to GISAID. This may explain why we filter out so many samples using a threshold of max submission delay of 30 days (as shown above).

I wanted to match the samples in the metadata files that Kat and Martin provided in order to see how different the submission dates are. But Kat's file contains GISAID ids and strain names, whereas Martin's file contains GenBank and ENA accessions.

@szhan
Copy link
Contributor Author

szhan commented May 23, 2024

Given the amount of data being excluded using a max submission delay of 30 days, and that the GISAID submission dates and ENA submission dates don't seem equivalent, it makes sense to not rely on submission delay-based filter to exclude probable time travellers. Instead, let's see how much HMM cost will help.

@szhan
Copy link
Contributor Author

szhan commented Jul 3, 2024

I think we have decided to pursue using HMM cost to filter out probable time travellers. There are some signs that the HMM cost strategy helps (e.g., reducing the number of mutations in some 20202 ARGs; see #188). Unless we are going to build ARGs out of GISAID data again, we probably don't need to deal with negative submission delays, so closing this for now.

@szhan szhan closed this as completed Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants