-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
analysing within-host diversity #44
Comments
Greetings, @aspitaleri! Thanks for the question and for using SNPGenie. The approach you describe does correspond to the "within-group" analysis. However, which approach to use depends on several factors, and it seems to me the "pooled within-host approach" i.e. the original snpgenie.pl is ideal here. If your samples are high-coverage (enough to confidently call within-sample single nucleotide variants), and the reads can be aligned to a common reference genome, then you can use SNP calling software to generate a SNP report (e.g., VCF file) for each sample. It might then be necessary to filter to control the false discovery rate (e.g., some SNVs might be due to sequencing error). On the other hand, generating a multi-FASTA file seems like overkill unless you need it for other purposes — much more efficient to have a single reference FASTA, and then a VCF file for each sample. Let me know if this helps! |
Hi @cwnelson88 thanks for your quick reply! First of all, since I am not sure about the vcf format I used both 1 and 3 (2 and 4 are not recognized). WARNING: Please be aware that the product nsp10 does not begin with a START (ATG) codon at site 13025, but rather GCT.If this was unexpected, please check your annotations.and so on. I wondering whether you have a demo data to test whether my installation (just git clone) is fine and how to interpret my results. As far as you know the vcf from VarScan is 1 or 3 version? Thanks! |
Unfortunately, the VCF format depends on settings and I am not sure which format is output. There is example data in the example data directory, but it may or may not have the exact VCF type you've produced. If you would like me to troubleshoot, I will need example input files (fasta / gtf / vcf) and the exact command you used so I can replicate the error. The WARNING you have received indicates that nsp10 begins with GCT. If that was unexpected, you should investigate it (most but not all genes start with ATG). I suggest checking the reference sequence to make sure it begins with GCT. As an nsp, it may indeed be correct (e.g., if the nsp product is cleaved from a larger gene that itself begins with ATG). |
Sure I can share with you my vcf and gtf of another project, SARS-CoV-2, where the error is from. Could I send you privately by email? Best |
Absolutely! My email is described in the Contact section: https://github.com/chasewnelson/SNPGenie/#contact |
Hi @cwnelson88 did you get my email yesterday? I was getting back email error from you. Best |
Received, @aspitaleri ! I will endeavor to look at this over the weekend and get back to you ASAP. |
Dear @aspitaleri : Thanks again for your patience with me. I have run SNPGenie on your data and it completed running without any problems. However, I do not understand why --vcfformat=1 was used. As described in the section on VCF formats, --vcfformat=1 means that SNPGenie will use AN and AF from the INFO column; but your INFO column does not contain these two (instead, it contains ADP, WT, HET, HOM, and NC). This will result in 0 variation being detected. The SNPGenie VCF format that is most similar to your file is --vcfformat=4, because your data have AD and DP in the FORMAT column. However, note that SNPGenie's VCF format 4 requires , in the AD entry. Unfortunately, your format has only (in AD), and (in RD) immediately before it (i.e., :, represented by RD:AD). Thus, for example, the very first record in your VCF file says: 0/1:90:1631:1585:1551:34:2.15%:8.8511E-10:48:41:952:599:26:8 but SNPGenie --vcfformat=4 needs: 0/1:90:1631:1585:1551:1551,34:2.15%:8.8511E-10:48:41:952:599:26:8 This is an unfortunate byproduct of the millions of different VCF formats out there. One quick solution would be to write a script that simply duplicates the RD value into the next datum and introduces a comma afterward, converting the "RD:AD" chunk of the sample column into "RD:RD,AD". This could be done using the command line (e.g., sed); a scripting language like Perl or Python; or R. (I apologize that I do not have time to implement variations on the available input formats.) In terms of warnings, the only warnings I get are that the nsp's don't start with START codons — and indeed, they shouldn't (for example, see their sequences at NCBI). So that part is not a problem. Let me know if that helps! |
Hi |
Hi @cwnelson88 |
Greeting @aspitaleri, and sorry I missed this. Indeed, product_results.txt contains the product (whole CDS)-level results for each uniquely named CDS record in the GTF file. Let me know if that's your question. |
yes indeed! |
Hi
I have read your exhaustive introduction and paper and I think it can be useful for my purpose. My interest is to study Ecoli within-host diversity for each sample. Basically, each sample is a fastq and after proper analysis, I found that there are major and minor variants for the Ecoli, indicating a population of same Ecoli but different strains. My idea is to define the diversity per sample, i.e. nucleotide diversity, and the compare the diversity between samples.
Said so, my idea is to generate a multi fasta file containing the sequences per each Ecoli strains coming from the same host, in order to get the nucleotide diversity per host. Then I will create a matrix nxn hosts and run some other analysis.
In order to use SNPGenie, as far as I understood I should use for this within-group analysis, right? Does my approach make sense to you?
The text was updated successfully, but these errors were encountered: