Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflicting SNP Report formats detected. #75

Open
bturne48 opened this issue Sep 11, 2024 · 11 comments
Open

Conflicting SNP Report formats detected. #75

bturne48 opened this issue Sep 11, 2024 · 11 comments

Comments

@bturne48
Copy link

Hi i am trying to run snpgenie.pl with this command:

 snpgenie.pl
--vcfformat=4
--minfreq=0.01
--outdir results/7_Snps/SnpGenie_Dyak_Tai18E2/NW_025048829.1
--snpreport=results/7_Snps/Dyak_Tai18E2.test.NW_025048829.1.vcf
--fastafile=results/7_Snps/Dyak_Tai18E2.test.NW_025048829.1.fa 
--gtffile=/scratch/bturne48/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.gtf 

and am confused why i am receiving this error:

## WARNING: Conflicting SNP Report formats detected. Does the file extension match expectation? If not, please contact author. ## SNPGenie TERMINATED.

I have made sure to include the AD and DP tags in both the format and sample columns, and am unsure what to do.

Here are 4 lines from my vcf as an example. I can provide more information! Any help would be amazing thank you

#CHROM	        POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	(all my sample names are the rest of the columns)
NW_025048829.1	335	.	C	.	20.051	.	DP=151;AD=149;VDB=0.42;SGB=-1.25312;RPBZ=-2.0839;MQBZ=0.977156;MQSBZ=3.49015;BQBZ=-2.21411;SCBZ=1.99133;MQ0F=0.211921;AN=104;DP4=65,84,0,2;MQ=24	GT:DP:AD	./.:0:0	0/0:2:2	0/0:1:1	0/0:1:1	0/0:2:2	0/0:2:2	0/0:3:3	./.:0:0	0/0:4:4	0/0:4:4	0/0:7:7	0/0:1:1	0/0:1:1	0/0:2:2	0/0:2:1	0/0:1:1	0/0:6:6	0/0:1:1	./.:0:0	./.:0:0	./.:0:0	0/0:7:7	0/0:5:5	0/0:13:13	0/0:6:6	./.:0:0	0/0:4:4	0/0:7:7	0/0:5:5	0/0:3:3	0/0:10:10	0/0:1:1	0/0:5:5	0/0:2:2	0/0:1:1	0/0:2:2	0/0:3:3	0/0:3:3	./.:0:0	./.:0:0	0/0:1:0	0/0:2:2	0/0:4:4	./.:0:0	0/0:2:2	./.:0:0	0/0:1:1	0/0:1:1	0/0:2:2	./.:0:0	./.:0:0	0/0:1:1	./.:0:0	0/0:1:1	0/0:4:4	0/0:1:1	./.:0:0	0/0:2:2	./.:0:0	./.:0:0	./.:0:0	./.:0:0	./.:0:0	0/0:1:1	0/0:1:1	0/0:1:1	./.:0:0	./.:0:0	0/0:2:2	0/0:1:1	./.:0:0	./.:0:0	0/0:2:2	0/0:2:2	./.:0:0	./.:0:0	0/0:2:2	./.:0:0
NW_025048829.1	336	.	G	A	76.1591	.	DP=149;AD=143,6;VDB=0.101965;SGB=7.02578;RPBZ=-0.912743;MQBZ=1.82338;MQSBZ=3.63272;BQBZ=1.44056;SCBZ=0.656512;MQ0F=0.194631;AC=3;AN=104;DP4=60,83,4,2;MQ=24	GT:PL:DP:AD:GQ	./.:0,0,0:0:0,0:0	0/0:0,6,38:2:2,0:16	0/0:0,3,27:1:1,0:13	0/0:0,3,8:1:1,0:13	0/0:0,6,40:2:2,0:16	0/0:0,6,66:2:2,0:16	0/0:0,9,79:3:3,0:19	./.:0,0,0:0:0,0:0	0/0:0,12,43:4:4,0:22	0/0:0,12,95:4:4,0:22	0/0:0,21,85:7:7,0:31	0/0:0,3,29:1:1,0:13	0/0:0,3,4:1:1,0:13	0/0:0,6,53:2:2,0:16	0/0:0,6,41:2:2,0:16	0/0:0,3,12:1:1,0:13	0/0:0,18,109:6:6,0:28	0/0:0,3,4:1:1,0:13	./.:0,0,0:0:0,0:0	./.:0,0,0:0:0,0:0	./.:0,0,0:0:0,0:0	0/0:0,18,149:6:6,0:28	0/0:0,15,126:5:5,0:25	0/0:0,36,198:12:12,0:46	0/0:0,18,123:6:6,0:28	./.:0,0,0:0:0,0:0	0/1:104,12,0:4:0,4:5	0/0:0,21,100:7:7,0:31	0/0:0,15,114:5:5,0:25	0/0:0,9,83:3:3,0:19	0/0:0,30,157:10:10,0:40	0/1:37,3,0:1:0,1:12	0/0:0,15,58:5:5,0:25	0/0:0,6,57:2:2,0:16	0/0:0,3,37:1:1,0:13	0/0:0,3,4:1:1,0:13	0/0:0,9,90:3:3,0:19	0/0:0,9,84:3:3,0:19	./.:0,0,0:0:0,0:0	./.:0,0,0:0:0,0:0	0/0:0,3,25:1:1,0:13	0/1:16,0,23:2:1,1:6	0/0:0,12,117:4:4,0:22	./.:0,0,0:0:0,0:0	0/0:0,6,58:2:2,0:16	./.:0,0,0:0:0,0:0	0/0:0,3,37:1:1,0:13	0/0:0,3,27:1:1,0:13	0/0:0,9,67:3:3,0:19	./.:0,0,0:0:0,0:0	./.:0,0,0:0:0,0:0	0/0:0,3,4:1:1,0:13	./.:0,0,0:0:0,0:0	0/0:0,3,29:1:1,0:13	0/0:0,12,68:4:4,0:22	0/0:0,3,29:1:1,0:13	./.:0,0,0:0:0,0:0	0/0:0,6,66:2:2,0:16	./.:0,0,0:0:0,0:0	./.:0,0,0:0:0,0:0	./.:0,0,0:0:0,0:0	./.:0,0,0:0:0,0:0	./.:0,0,0:0:0,0:0	0/0:0,3,37:1:1,0:13	0/0:0,3,29:1:1,0:13	0/0:0,3,15:1:1,0:13	./.:0,0,0:0:0,0:0	./.:0,0,0:0:0,0:0	0/0:0,6,53:2:2,0:16	0/0:0,3,15:1:1,0:13	./.:0,0,0:0:0,0:0	./.:0,0,0:0:0,0:0	0/0:0,6,47:2:2,0:16	0/0:0,6,56:2:2,0:16	./.:0,0,0:0:0,0:0	./.:0,0,0:0:0,0:0	0/0:0,6,56:2:2,0:16	./.:0,0,0:0:0,0:0
NW_025048829.1	337	.	G	T	474.423	.	DP=146;AD=117,28;VDB=0.00484544;SGB=14.0114;RPBZ=-2.03359;MQBZ=1.67476;MQSBZ=3.34009;BQBZ=1.10604;SCBZ=0.0465449;MQ0F=0.184932;AC=17;AN=102;DP4=49,68,13,16;MQ=24	GT:PL:DP:AD:GQ	./.:0,0,0:0:0,0:0	0/1:31,6,0:2:0,2:4	0/1:29,3,0:1:0,1:6	0/0:0,3,29:1:1,0:6	0/0:0,6,40:2:2,0:8	0/0:0,6,66:2:2,0:8	0/1:30,0,28:3:1,2:27	./.:0,0,0:0:0,0:0	0/0:0,12,58:4:4,0:14	0/0:0,12,84:4:4,0:14	0/0:0,18,62:6:6,0:19	0/0:0,3,29:1:1,0:6	./.:0,0,0:0:0,0:0	0/0:0,6,47:2:2,0:8	0/0:0,6,41:2:2,0:8	0/1:12,3,0:1:0,1:4	0/0:0,18,124:6:6,0:19	0/0:0,3,4:1:1,0:5	./.:0,0,0:0:0,0:0	./.:0,0,0:0:0,0:0	./.:0,0,0:0:0,0:0	0/0:0,18,149:6:6,0:19	0/0:0,15,126:5:5,0:16	0/0:0,36,198:12:12,0:37	0/0:0,18,123:6:6,0:19	./.:0,0,0:0:0,0:0	1/1:104,12,0:4:0,4:5	0/0:0,21,100:7:7,0:22	1/1:114,15,0:5:0,5:7	1/1:83,9,0:3:0,3:3	0/0:0,30,157:10:10,0:31	0/1:37,3,0:1:0,1:6	0/0:0,12,58:4:4,0:14	0/1:57,6,0:2:0,2:4	0/0:0,3,37:1:1,0:6	0/0:0,3,4:1:1,0:5	0/0:0,9,90:3:3,0:11	1/1:84,9,0:3:0,3:3	./.:0,0,0:0:0,0:0	./.:0,0,0:0:0,0:0	0/0:0,3,27:1:1,0:6	0/0:0,6,47:2:2,0:8	0/0:0,12,109:4:4,0:14	./.:0,0,0:0:0,0:0	0/0:0,6,58:2:2,0:8	./.:0,0,0:0:0,0:0	0/0:0,3,37:1:1,0:6	0/0:0,3,29:1:1,0:6	0/0:3,9,52:3:2,0:8	./.:0,0,0:0:0,0:0	./.:0,0,0:0:0,0:0	0/0:0,3,4:1:1,0:5	./.:0,0,0:0:0,0:0	0/0:0,3,29:1:1,0:6	0/0:0,12,65:4:4,0:14	0/0:0,3,29:1:1,0:6	./.:0,0,0:0:0,0:0	0/0:0,6,63:2:2,0:8	./.:0,0,0:0:0,0:0	./.:0,0,0:0:0,0:0	./.:0,0,0:0:0,0:0	./.:0,0,0:0:0,0:0	./.:0,0,0:0:0,0:0	0/0:0,3,37:1:1,0:6	0/0:0,3,29:1:1,0:6	0/1:15,3,0:1:0,1:5	./.:0,0,0:0:0,0:0	./.:0,0,0:0:0,0:0	0/1:53,6,0:2:0,2:4	0/1:15,3,0:1:0,1:5	./.:0,0,0:0:0,0:0	./.:0,0,0:0:0,0:0	0/0:0,6,47:2:2,0:8	0/0:0,6,58:2:2,0:8	./.:0,0,0:0:0,0:0	./.:0,0,0:0:0,0:0	0/0:0,6,51:2:2,0:8	./.:0,0,0:0:0,0:0
NW_025048829.1	338	.	G	.	20.7671	.	DP=143;AD=141;VDB=0.32;SGB=-1.25312;RPBZ=-1.97735;MQBZ=0.0718816;MQSBZ=3.35261;BQBZ=-2.3969;SCBZ=1.89094;MQ0F=0.188811;AN=102;DP4=60,81,0,2;MQ=24	GT:DP:AD	./.:0:0	0/0:2:2	0/0:1:1	0/0:1:1	0/0:2:2	0/0:2:2	0/0:3:3	./.:0:0	0/0:4:4	0/0:4:4	0/0:6:6	0/0:1:1	./.:0:0	0/0:2:2	0/0:2:1	0/0:1:1	0/0:6:6	0/0:1:1	./.:0:0	./.:0:0	./.:0:0	0/0:6:6	0/0:5:5	0/0:12:12	0/0:6:6	./.:0:0	0/0:4:4	0/0:7:7	0/0:5:5	0/0:3:3	0/0:10:10	0/0:1:1	0/0:4:4	0/0:2:2	0/0:1:1	0/0:1:1	0/0:3:3	0/0:3:3	./.:0:0	./.:0:0	0/0:1:1	0/0:2:2	0/0:3:3	./.:0:0	0/0:2:2	./.:0:0	0/0:1:1	0/0:1:1	0/0:3:3	./.:0:0	./.:0:0	0/0:1:1	./.:0:0	0/0:1:1	0/0:4:4	0/0:1:1	./.:0:0	0/0:1:1	./.:0:0	./.:0:0	./.:0:0	./.:0:0	./.:0:0	0/0:1:1	0/0:1:1	0/0:1:1	./.:0:0	./.:0:0	0/0:2:2	0/0:1:1	./.:0:0	./.:0:0	0/0:2:1	0/0:1:1	./.:0:0	./.:0:0	0/0:2:2	./.:0:0

@singing-scientist
Copy link
Contributor

Greetings @bturne48 ! Thanks very much for the question.

The first issue I see is that the FORMAT/sample data differs from line to line. For example, the first record here has GT:DP:AD, whereas the second has GT:PL:DP:AD:GQ. SNPGenie expects all lines to be formatted identically, and I suspect that is the first reason (but possibly not the only reason) for an error.

After ensuring that all lines have identically formatted data, perhaps give it another try and let me know if it works as expected? Happy to troubleshoot further.

Yours,
Chase

@bturne48
Copy link
Author

Hey Chase,

Thanks so much for the help. I parsed the VCF to exclude invariant sites, so every line should have identical tags now.

While this did not fix my og error, I did notice that my sample names have backslashes "/" to their corresponding bam paths. and I believe that this was causing the issue. Example:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT results/sample1.vcf results/sample2.vcf

When snpgenie goes to create temp vcf files for each sample from my multi-sample vcf, it was turning them into temp directories due to the backslashes in the sample names.

Creating temp_vcf4_results/sample1.vcf...
Creating temp_vcf4_results/sample2.vcf...

Sorry for the trouble. Might have more questions later, but I can open a new ticket for those if that is preferable!

@singing-scientist
Copy link
Contributor

Oh wow! Great catch, I haven't run into that before. Agreed, just do something like remove the base path (i.e., 'results/' here) manually (FIND/REPLACE) if you can, or something similar. Let me know how it goes and don't hesitate to reach out if there are other questions!

Chase

@bturne48
Copy link
Author

Hi Chase,

Just throwing this question on this thread since it isn't closed yet. Let me know if you'd prefer to close this and I can communicate with you somewhere else.

If I have 50 pooled samples from one population, is it appropriate to run SNPgenie on each of these pooled samples individually, and merge results into one population summary report at the end?

I have separated out input by chromosome (as required), however runtime is still pretty long, so I have separated the VCF input by sample as well, and just wanted to make sure that works with the assumptions of SNPgenie.

Thanks again!

@singing-scientist
Copy link
Contributor

Hello! Totally fine to continue here as the conversation may benefit others in the future; contrarily if the questions deal with a confidential study, please feel free to email.

If I have 50 pooled samples from one population, is it appropriate to run SNPgenie on each of these pooled samples individually, and merge results into one population summary report at the end?

Well, there is nothing inappropriate about it — it all depends on the question or hypothesis you're addressing. A summary of all populations would result in a metric describing, if you will, the meta-population, and may or may not be what you're interested in characterizing.

I have separated out input by chromosome (as required), however runtime is still pretty long, so I have separated the VCF input by sample as well, and just wanted to make sure that works with the assumptions of SNPgenie.

Great question! Yes, that is totally appropriate, i.e., results for each population (sample) are independent. Thus, the problem is an 'embarrassingly parallel' one, and running SNPGenie separately for each sample and chromosome is both an effective and appropriate way to speed up the runtime.

Let me know if that helps!
Chase

@bturne48
Copy link
Author

bturne48 commented Sep 20, 2024

Great. Thanks for all the help! I think I should only have one more question.

I have combined my results files for all my samples into one combined file (population_summary.txt), however I now see that I need to run snpgenie twice, if I want to use the '-" strand from the gtf.

If I wanted to combine the results of these two runs from the + and - , do you have any suggestions? Or is that not really possible? Or if certain fields like PiN/PiS could be combined, while others could not, that info would be useful.

Thanks!

@singing-scientist
Copy link
Contributor

You bet! This is where a deeper understanding of pi becomes necessary, and some downstream data science (e.g. with R) may be handy. If you have results from both strands of the same sample, they can indeed be combined into a single summary statistic. For example, the overall value of piN for the sample would be (N_diffs_strand1 + N_diffs_strand2) / (N_sites_strand1 + N_sites_strand2). Likewise for piS. This simple calculation works for individual codons, whole ORFs, or even the full protein-coding genome. There are virtually limitless analyses or comparisons one can do, so it's important to form precise hypotheses ahead of time and focus on what comparisons/metrics are appropriate for addressing them. Let me know if that makes sense!

Chase

@bturne48
Copy link
Author

bturne48 commented Oct 7, 2024

Hey Chase, thanks for all your help so far. Was wondering if I could run this past you, totally cool if not.

I had a question regarding my results. I merged my results (+ and - strand) and was working through them. Below I have both strands for 2 of my samples.

Do you also find it suspect that there are far more nonsyn mutations than there are syn mutations. I am worried that I did something wrong with the input because of the magnitude of difference between the two. From literature I thought that syn mutations should be far more common, but I am unsure.

Thank you so much for your help

file	sites	sites_coding	sites_noncoding	pi	pi_coding	pi_noncoding	N_sites	S_sites	piN	piS	mean_dN_vs_ref	mean_dS_vs_ref	mean_gdiv_polymorphic	mean_N_gdiv	mean_S_gdiv	mean_gdiv	sites_polymorphic	mean_gdiv_coding_poly	sites_coding_poly	mean_gdiv_noncoding_poly	sites_noncoding_poly

temp_vcf4_Bata6.vcf	24674056	3828975	20845081	0.000509543234000816	0.000242324893686526	0.00055862782832813	1484236.91090886	468064.411646331	0.000156770247136192	0.000565384430333997	0.00576503943164924	0.038129645360846	0.00105072377206131	0.372450067620597	0.355449746228464	0.000461104533970691	433605	0.016210227842936	52542	0.0276216827708886	381063

temp_vcf4_Bata6.vcf.rev	24674056	3828975	20845081	0.00050954323400082	0.000242324893686526	0.000558627828328135	1426704.39721135	449851.959931503	0.000120107506698998	0.000576139612300811	0.00538052207093723	0.0388692743082501	0.000913547376388415	0.378523020565642	0.357159370777168	0.000461104533970693	433605	0.016210227842936	52542	0.0276216827708884	381063

temp_vcf4_Cameroon_115.vcf	24674056	3828975	20845081	0.000373630307395286	0.00022380637324866	0.000401151049495019	1484262.19714368	468075.001419281	0.000106276274383141	0.000550039710172113	0.00178287420560013	0.017125707390797	0.0020162201341968	0.389704341931958	0.388225560362	0.000339883886651677	191050	0.0365187197546601	21761	0.0448442024712771	169289

temp_vcf4_Cameroon_115.vcf.rev	24674056	3828975	20845081	0.000373630307395287	0.00022380637324866	0.000401151049495023	1426727.8508098	449846.32828684	0.000104598622285654	0.000650252158137782	0.00162771477762549	0.0175616054801894	0.00214333946057503	0.398018828353776	0.398470411902559	0.000339883886651676	191050	0.0365187197546602	21761	0.0448442024712769	169289

@singing-scientist
Copy link
Contributor

Greetings @bturne48 — no worries! I'm not exactly sure how you're concluding there are more nonsynonymous mutations, are you counting them somehow? Regardless, 3nonsyn:1syn is actually approximately the neutral expectation, i.e., ~75% of random mutations will be nonsynonymous given the nature of the genetic code. One place this is displayed is Graur & Li, Fundamentals of Molecular Evolution, 2nd Edition, Table 1.5 "Relative frequencies of different types of mutational substitutions in random protein-coding genes". I think of 75% nonsynonymous as being approximately dN/dS = 1. Let me know if that helps :->

@bturne48
Copy link
Author

bturne48 commented Oct 7, 2024

Thanks for the speedy reply Chase. I was using the fields "N_sites" and "S_sites" for the example above, and noticing that there was (as you said) about 3x as many non-syn sites. Is this a misinterpretation of what those fields are reporting?

sample: ['temp_vcf4_Bata6.vcf', 'temp_vcf4_Cameroon_115.vcf', 'temp_vcf4_CY01A.vcf']
nonsyn: [1484236, 1484262, 1484244]
syn: [468064, 468075, 468087]

sample: ['temp_vcf4_Bata6.vcf.rev', 'temp_vcf4_Cameroon_115.vcf.rev', 'temp_vcf4_CY01A.vcf.rev']
nonsyn: [1426704, 1426727, 1426701]
syn: [449851, 449846, 449862]

Also thank you for the literature, incredibly helpful!!

@singing-scientist
Copy link
Contributor

You're most welcome! Yes, that is a misinterpretation. N_sites is the number of nonsynonymous sites in the sequence, not the number of nonsynonymous changes. Sites can be thought of as possible changes. Strongly recommend to familiarize with background material on molecular evolution in general and dN/dS in particular before interpreting analysis results, or enlist a molecular evolution collaborator. Great resources are the text I mentioned, or anything by Wen-Hsiung Li, Austin Hughes, Masatoshi Nei, Zvelebil & Baum, Asher D. Cutter, etc. Hope that helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants