Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering out samples in Viridian v0.4 dataset #204

Open
szhan opened this issue Jul 29, 2024 · 9 comments
Open

Filtering out samples in Viridian v0.4 dataset #204

szhan opened this issue Jul 29, 2024 · 9 comments

Comments

@szhan
Copy link
Contributor

szhan commented Jul 29, 2024

Before doing runs, I have been filtering out samples in the Viridian dataset based on two criteria: (1) having full-precision collection dates, and (2) having at most 800 Ns (excluding gaps) in the aligned consensus sequence (i.e. disregarding insertions). A better way is to exclude problematic sites before filtering by the maximum N criterion.

@szhan
Copy link
Contributor Author

szhan commented Jul 30, 2024

I'll take a look at the breakdown of samples filtered out at varying values of the maximum N threshold. I went with 800 Ns initially, because it is roughly two amplicons (e.g. if the terminal amplicons drop out). Maybe it's tossing out too many samples.

@szhan
Copy link
Contributor Author

szhan commented Jul 31, 2024

I've been filtering out the samples before importing the alignments. Probably, it is better to implement a simple filter based on the number of Ns to filter out sample during inference.

@jeromekelleher
Copy link
Owner

Agreed - let's keep as much of the filtering and data pre-processing logic within sc2ts as we can

@szhan
Copy link
Contributor Author

szhan commented Jul 31, 2024

Filtering samples by the imported sequence alignments would involve grabbing the alignment from alignment store and then processing it in preprocess_and_match_alignments. This would require an additional pass over the Sample objects I think, because the genotype matrix which goes into HMM matching is preset.

@szhan
Copy link
Contributor Author

szhan commented Jul 31, 2024

Or maybe keep a boolean array to keep track of which samples pass filters, and then use it to subset the genotype matrix before input to HMM matching.

@jeromekelleher
Copy link
Owner

Filtering samples by the imported sequence alignments would involve grabbing the alignment from alignment store and then processing it in preprocess_and_match_alignments. This would require an additional pass over the Sample objects I think, because the genotype matrix which goes into HMM matching is preset.

That's OK I think - we can easily break preprocess_and_match_alignments into steps, or add some complexity where we only pass alignments that meet QC requirements on to the matching step.

@szhan
Copy link
Contributor Author

szhan commented Aug 1, 2024

Also, there are a number of entries in the metadata file that do not have full-precision dates. I have been filtering them out before import the metadata. It is better that this, too, is done within sc2ts.

@szhan
Copy link
Contributor Author

szhan commented Aug 1, 2024

It seems that both of these filters (and any other filter on the metadata and alignments) can be done in preprocess_and_match_alignments. Or refactor it into preprocess_samples and match_alignments, where we can implement the filters.

@szhan
Copy link
Contributor Author

szhan commented Aug 1, 2024

Hmm, actually, about the full-precision dates from metadata, I don't think get is getting entries by comparing dates. It is just getting entries by comparing dates in the form of strings. So, I don't think it needs to be modified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants