Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

analysing within-host diversity #44

Open
aspitaleri opened this issue Mar 12, 2021 · 12 comments
Open

analysing within-host diversity #44

aspitaleri opened this issue Mar 12, 2021 · 12 comments

Comments

@aspitaleri
Copy link

Hi
I have read your exhaustive introduction and paper and I think it can be useful for my purpose. My interest is to study Ecoli within-host diversity for each sample. Basically, each sample is a fastq and after proper analysis, I found that there are major and minor variants for the Ecoli, indicating a population of same Ecoli but different strains. My idea is to define the diversity per sample, i.e. nucleotide diversity, and the compare the diversity between samples.
Said so, my idea is to generate a multi fasta file containing the sequences per each Ecoli strains coming from the same host, in order to get the nucleotide diversity per host. Then I will create a matrix nxn hosts and run some other analysis.
In order to use SNPGenie, as far as I understood I should use for this within-group analysis, right? Does my approach make sense to you?

@singing-scientist
Copy link
Contributor

Greetings, @aspitaleri! Thanks for the question and for using SNPGenie. The approach you describe does correspond to the "within-group" analysis. However, which approach to use depends on several factors, and it seems to me the "pooled within-host approach" i.e. the original snpgenie.pl is ideal here. If your samples are high-coverage (enough to confidently call within-sample single nucleotide variants), and the reads can be aligned to a common reference genome, then you can use SNP calling software to generate a SNP report (e.g., VCF file) for each sample. It might then be necessary to filter to control the false discovery rate (e.g., some SNVs might be due to sequencing error). On the other hand, generating a multi-FASTA file seems like overkill unless you need it for other purposes — much more efficient to have a single reference FASTA, and then a VCF file for each sample. Let me know if this helps!

@aspitaleri
Copy link
Author

Hi @cwnelson88 thanks for your quick reply!
I have tried as you suggested the following:
perl $SOFTW/SNPGenie/snpgenie.pl --vcfformat=3 --snpreport=file.vcf --fastafile=ref.fasta
where file.vcf is from
samtools mpileup -f ../MN908947.3.fasta 31_S52.sort.nodup.bam > file.mpileup
java -jar ~/bin/VarScan.v2.4.4.jar mpileup2snp file.mpileup --output-vcf 1 --min-var-freq 0.01 > file.vcf

First of all, since I am not sure about the vcf format I used both 1 and 3 (2 and 4 are not recognized).
Both options gave similar output, I mean 0 in population_summary.txt, product_results.txt and codon_results.txt, except the column N_sites and S_sites. I think something wrong in my inputs. I got warnings about the start codon like:

WARNING: Please be aware that the product nsp10 does not begin with a START (ATG) codon at site 13025, but rather GCT.

If this was unexpected, please check your annotations.

and so on.

I wondering whether you have a demo data to test whether my installation (just git clone) is fine and how to interpret my results. As far as you know the vcf from VarScan is 1 or 3 version?

Thanks!

@singing-scientist
Copy link
Contributor

Unfortunately, the VCF format depends on settings and I am not sure which format is output. There is example data in the example data directory, but it may or may not have the exact VCF type you've produced. If you would like me to troubleshoot, I will need example input files (fasta / gtf / vcf) and the exact command you used so I can replicate the error.

The WARNING you have received indicates that nsp10 begins with GCT. If that was unexpected, you should investigate it (most but not all genes start with ATG). I suggest checking the reference sequence to make sure it begins with GCT. As an nsp, it may indeed be correct (e.g., if the nsp product is cleaved from a larger gene that itself begins with ATG).

@aspitaleri
Copy link
Author

Sure I can share with you my vcf and gtf of another project, SARS-CoV-2, where the error is from. Could I send you privately by email? Best

@singing-scientist
Copy link
Contributor

Absolutely! My email is described in the Contact section: https://github.com/chasewnelson/SNPGenie/#contact

@aspitaleri
Copy link
Author

Hi @cwnelson88 did you get my email yesterday? I was getting back email error from you. Best

@singing-scientist
Copy link
Contributor

Received, @aspitaleri ! I will endeavor to look at this over the weekend and get back to you ASAP.

@singing-scientist
Copy link
Contributor

Dear @aspitaleri :

Thanks again for your patience with me. I have run SNPGenie on your data and it completed running without any problems. However, I do not understand why --vcfformat=1 was used. As described in the section on VCF formats, --vcfformat=1 means that SNPGenie will use AN and AF from the INFO column; but your INFO column does not contain these two (instead, it contains ADP, WT, HET, HOM, and NC). This will result in 0 variation being detected.

The SNPGenie VCF format that is most similar to your file is --vcfformat=4, because your data have AD and DP in the FORMAT column. However, note that SNPGenie's VCF format 4 requires , in the AD entry. Unfortunately, your format has only (in AD), and (in RD) immediately before it (i.e., :, represented by RD:AD). Thus, for example, the very first record in your VCF file says:

0/1:90:1631:1585:1551:34:2.15%:8.8511E-10:48:41:952:599:26:8

but SNPGenie --vcfformat=4 needs:

0/1:90:1631:1585:1551:1551,34:2.15%:8.8511E-10:48:41:952:599:26:8

This is an unfortunate byproduct of the millions of different VCF formats out there. One quick solution would be to write a script that simply duplicates the RD value into the next datum and introduces a comma afterward, converting the "RD:AD" chunk of the sample column into "RD:RD,AD". This could be done using the command line (e.g., sed); a scripting language like Perl or Python; or R. (I apologize that I do not have time to implement variations on the available input formats.)

In terms of warnings, the only warnings I get are that the nsp's don't start with START codons — and indeed, they shouldn't (for example, see their sequences at NCBI). So that part is not a problem.

Let me know if that helps!
Chase

@aspitaleri
Copy link
Author

Hi
yes I did with vim and it worked! thanks! I will go through it and I will let you know what I get if you wish!
Best

@aspitaleri
Copy link
Author

Hi @cwnelson88
things are going fine and I am generating data. One question: product_results.txt contains the output from the different loci as reported in the gtf file? I mean, the various pi, dN, dS, and son on.

@singing-scientist
Copy link
Contributor

Greeting @aspitaleri, and sorry I missed this. Indeed, product_results.txt contains the product (whole CDS)-level results for each uniquely named CDS record in the GTF file. Let me know if that's your question.

@aspitaleri
Copy link
Author

yes indeed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants