-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion on Masking the regions with coverage less than 20X #35
Comments
I came across a recent collaborative preprint by NCBI and several other institutes on best protocols for generation of Consensus assembly (see here). They recommend the following protocol |
I also observed that you assign the Variants just based on VAF and not Allele Depth. For Example a variant might be supported by just 10X reads out of which 9 support it's presence. In that case VAF will be 0.8 but the depth at that region is not enough to say with confidence. Another example from real data
Here we have |
Just added in #37 the FASTA sequences to the output. Will follow up on the rest, the filtering by depth requires a thorough analysis from our side. The whole dataset of mutations in tabular format is publicly available here https://covigator.tron-mainz.de/download/variant_observation.csv.gz, that contains the VAF, DP and AC annotations for every variant call so the impact of such filtering could be evaluated there. |
Hi @priesgo. Were you able to check the impact of such filtering. I realized that using criteria |
Apologies for ultra-late response @Rohit-Satyam , we have not looked into this yet. We will eventually get here. FYI @johausmann |
Stumbling upon this here. As a consensus sequence consumer - I'm very aware of some pipelines having a strong tendency to erroneously call reference when we know that it's unlikely to be the case due to phylogenetic placement. What @Rohit-Satyam explains appears to be a possible mechanism to explain that. @priesgo do you have a rough indication of how many sequences in GISAID were produced with this pipeline? Are there some labs that you know are using it? Unfortunately, submitters don't usually point out the pipeline they used. |
Dear @priesgo
While going through the pipeline, I realized that the VCF2FASTA does not masks the regions of low coverage. In amplicon sequencing data, there might be regions deprived of reads. I think it make sense to mask such regions with Ns that we know are covered by less than 20 reads as we can't say about variant status in that region. Using the reference sequence in such regions might indicate that there is no variant in that region.
Besides, could you enable the pipeline to copy these fasta assemblies as well to the result directory. I tried using
--keep_intermediate
but it didn't help. I can see there is nopublishDir "${params.output}", mode: "copy"
forVCF2FASTA
process.EDIT 1: Also, I think the code take the normalized VCFs and without filtering the Low Frequency and Subclonal variants, it includes all of them in the assembly. Is this assembly good for GISAID submission? Because as you say in your documentation
Other low quality mutations are removed from the output.
The text was updated successfully, but these errors were encountered: