analysing within-host diversity #44

aspitaleri · 2021-03-12T00:50:55Z

Hi
I have read your exhaustive introduction and paper and I think it can be useful for my purpose. My interest is to study Ecoli within-host diversity for each sample. Basically, each sample is a fastq and after proper analysis, I found that there are major and minor variants for the Ecoli, indicating a population of same Ecoli but different strains. My idea is to define the diversity per sample, i.e. nucleotide diversity, and the compare the diversity between samples.
Said so, my idea is to generate a multi fasta file containing the sequences per each Ecoli strains coming from the same host, in order to get the nucleotide diversity per host. Then I will create a matrix nxn hosts and run some other analysis.
In order to use SNPGenie, as far as I understood I should use for this within-group analysis, right? Does my approach make sense to you?

singing-scientist · 2021-03-12T06:32:11Z

Greetings, @aspitaleri! Thanks for the question and for using SNPGenie. The approach you describe does correspond to the "within-group" analysis. However, which approach to use depends on several factors, and it seems to me the "pooled within-host approach" i.e. the original snpgenie.pl is ideal here. If your samples are high-coverage (enough to confidently call within-sample single nucleotide variants), and the reads can be aligned to a common reference genome, then you can use SNP calling software to generate a SNP report (e.g., VCF file) for each sample. It might then be necessary to filter to control the false discovery rate (e.g., some SNVs might be due to sequencing error). On the other hand, generating a multi-FASTA file seems like overkill unless you need it for other purposes — much more efficient to have a single reference FASTA, and then a VCF file for each sample. Let me know if this helps!

aspitaleri · 2021-03-12T19:07:00Z

Hi @cwnelson88 thanks for your quick reply!
I have tried as you suggested the following:
perl $SOFTW/SNPGenie/snpgenie.pl --vcfformat=3 --snpreport=file.vcf --fastafile=ref.fasta
where file.vcf is from
samtools mpileup -f ../MN908947.3.fasta 31_S52.sort.nodup.bam > file.mpileup
java -jar ~/bin/VarScan.v2.4.4.jar mpileup2snp file.mpileup --output-vcf 1 --min-var-freq 0.01 > file.vcf

First of all, since I am not sure about the vcf format I used both 1 and 3 (2 and 4 are not recognized).
Both options gave similar output, I mean 0 in population_summary.txt, product_results.txt and codon_results.txt, except the column N_sites and S_sites. I think something wrong in my inputs. I got warnings about the start codon like:

WARNING: Please be aware that the product nsp10 does not begin with a START (ATG) codon at site 13025, but rather GCT.

If this was unexpected, please check your annotations.

and so on.

I wondering whether you have a demo data to test whether my installation (just git clone) is fine and how to interpret my results. As far as you know the vcf from VarScan is 1 or 3 version?

Thanks!

singing-scientist · 2021-03-15T06:42:10Z

Unfortunately, the VCF format depends on settings and I am not sure which format is output. There is example data in the example data directory, but it may or may not have the exact VCF type you've produced. If you would like me to troubleshoot, I will need example input files (fasta / gtf / vcf) and the exact command you used so I can replicate the error.

The WARNING you have received indicates that nsp10 begins with GCT. If that was unexpected, you should investigate it (most but not all genes start with ATG). I suggest checking the reference sequence to make sure it begins with GCT. As an nsp, it may indeed be correct (e.g., if the nsp product is cleaved from a larger gene that itself begins with ATG).

aspitaleri · 2021-03-15T08:02:42Z

Sure I can share with you my vcf and gtf of another project, SARS-CoV-2, where the error is from. Could I send you privately by email? Best

singing-scientist · 2021-03-16T06:40:22Z

Absolutely! My email is described in the Contact section: https://github.com/chasewnelson/SNPGenie/#contact

aspitaleri · 2021-03-18T13:18:10Z

Hi @cwnelson88 did you get my email yesterday? I was getting back email error from you. Best

singing-scientist · 2021-03-19T08:36:09Z

Received, @aspitaleri ! I will endeavor to look at this over the weekend and get back to you ASAP.

singing-scientist · 2021-03-24T05:41:04Z

Dear @aspitaleri :

Thanks again for your patience with me. I have run SNPGenie on your data and it completed running without any problems. However, I do not understand why --vcfformat=1 was used. As described in the section on VCF formats, --vcfformat=1 means that SNPGenie will use AN and AF from the INFO column; but your INFO column does not contain these two (instead, it contains ADP, WT, HET, HOM, and NC). This will result in 0 variation being detected.

The SNPGenie VCF format that is most similar to your file is --vcfformat=4, because your data have AD and DP in the FORMAT column. However, note that SNPGenie's VCF format 4 requires , in the AD entry. Unfortunately, your format has only (in AD), and (in RD) immediately before it (i.e., :, represented by RD:AD). Thus, for example, the very first record in your VCF file says:

0/1:90:1631:1585:1551:34:2.15%:8.8511E-10:48:41:952:599:26:8

but SNPGenie --vcfformat=4 needs:

0/1:90:1631:1585:1551:1551,34:2.15%:8.8511E-10:48:41:952:599:26:8

This is an unfortunate byproduct of the millions of different VCF formats out there. One quick solution would be to write a script that simply duplicates the RD value into the next datum and introduces a comma afterward, converting the "RD:AD" chunk of the sample column into "RD:RD,AD". This could be done using the command line (e.g., sed); a scripting language like Perl or Python; or R. (I apologize that I do not have time to implement variations on the available input formats.)

In terms of warnings, the only warnings I get are that the nsp's don't start with START codons — and indeed, they shouldn't (for example, see their sequences at NCBI). So that part is not a problem.

Let me know if that helps!
Chase

aspitaleri · 2021-03-24T06:47:46Z

Hi
yes I did with vim and it worked! thanks! I will go through it and I will let you know what I get if you wish!
Best

aspitaleri · 2021-03-29T14:03:31Z

Hi @cwnelson88
things are going fine and I am generating data. One question: product_results.txt contains the output from the different loci as reported in the gtf file? I mean, the various pi, dN, dS, and son on.

singing-scientist · 2021-04-21T06:40:42Z

Greeting @aspitaleri, and sorry I missed this. Indeed, product_results.txt contains the product (whole CDS)-level results for each uniquely named CDS record in the GTF file. Let me know if that's your question.

aspitaleri · 2021-04-21T10:47:25Z

yes indeed!

qianxuans mentioned this issue Jan 30, 2023

Need help to determine method for inference of convergent evolution #66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analysing within-host diversity #44

analysing within-host diversity #44

aspitaleri commented Mar 12, 2021

singing-scientist commented Mar 12, 2021

aspitaleri commented Mar 12, 2021

singing-scientist commented Mar 15, 2021

aspitaleri commented Mar 15, 2021

singing-scientist commented Mar 16, 2021

aspitaleri commented Mar 18, 2021

singing-scientist commented Mar 19, 2021

singing-scientist commented Mar 24, 2021

aspitaleri commented Mar 24, 2021

aspitaleri commented Mar 29, 2021

singing-scientist commented Apr 21, 2021

aspitaleri commented Apr 21, 2021

analysing within-host diversity #44

analysing within-host diversity #44

Comments

aspitaleri commented Mar 12, 2021

singing-scientist commented Mar 12, 2021

aspitaleri commented Mar 12, 2021

WARNING: Please be aware that the product nsp10 does not begin with a START (ATG) codon at site 13025, but rather GCT.

If this was unexpected, please check your annotations.

singing-scientist commented Mar 15, 2021

aspitaleri commented Mar 15, 2021

singing-scientist commented Mar 16, 2021

aspitaleri commented Mar 18, 2021

singing-scientist commented Mar 19, 2021

singing-scientist commented Mar 24, 2021

aspitaleri commented Mar 24, 2021

aspitaleri commented Mar 29, 2021

singing-scientist commented Apr 21, 2021

aspitaleri commented Apr 21, 2021