-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conflicting SNP Report formats detected. #75
Comments
Greetings @bturne48 ! Thanks very much for the question. The first issue I see is that the FORMAT/sample data differs from line to line. For example, the first record here has GT:DP:AD, whereas the second has GT:PL:DP:AD:GQ. SNPGenie expects all lines to be formatted identically, and I suspect that is the first reason (but possibly not the only reason) for an error. After ensuring that all lines have identically formatted data, perhaps give it another try and let me know if it works as expected? Happy to troubleshoot further. Yours, |
Hey Chase, Thanks so much for the help. I parsed the VCF to exclude invariant sites, so every line should have identical tags now. While this did not fix my og error, I did notice that my sample names have backslashes "/" to their corresponding bam paths. and I believe that this was causing the issue. Example:
When snpgenie goes to create temp vcf files for each sample from my multi-sample vcf, it was turning them into temp directories due to the backslashes in the sample names.
Sorry for the trouble. Might have more questions later, but I can open a new ticket for those if that is preferable! |
Oh wow! Great catch, I haven't run into that before. Agreed, just do something like remove the base path (i.e., 'results/' here) manually (FIND/REPLACE) if you can, or something similar. Let me know how it goes and don't hesitate to reach out if there are other questions! Chase |
Hi Chase, Just throwing this question on this thread since it isn't closed yet. Let me know if you'd prefer to close this and I can communicate with you somewhere else. If I have 50 pooled samples from one population, is it appropriate to run SNPgenie on each of these pooled samples individually, and merge results into one population summary report at the end? I have separated out input by chromosome (as required), however runtime is still pretty long, so I have separated the VCF input by sample as well, and just wanted to make sure that works with the assumptions of SNPgenie. Thanks again! |
Hello! Totally fine to continue here as the conversation may benefit others in the future; contrarily if the questions deal with a confidential study, please feel free to email.
Well, there is nothing inappropriate about it — it all depends on the question or hypothesis you're addressing. A summary of all populations would result in a metric describing, if you will, the meta-population, and may or may not be what you're interested in characterizing.
Great question! Yes, that is totally appropriate, i.e., results for each population (sample) are independent. Thus, the problem is an 'embarrassingly parallel' one, and running SNPGenie separately for each sample and chromosome is both an effective and appropriate way to speed up the runtime. Let me know if that helps! |
Great. Thanks for all the help! I think I should only have one more question. I have combined my results files for all my samples into one combined file (population_summary.txt), however I now see that I need to run snpgenie twice, if I want to use the '-" strand from the gtf. If I wanted to combine the results of these two runs from the + and - , do you have any suggestions? Or is that not really possible? Or if certain fields like PiN/PiS could be combined, while others could not, that info would be useful. Thanks! |
You bet! This is where a deeper understanding of pi becomes necessary, and some downstream data science (e.g. with R) may be handy. If you have results from both strands of the same sample, they can indeed be combined into a single summary statistic. For example, the overall value of piN for the sample would be (N_diffs_strand1 + N_diffs_strand2) / (N_sites_strand1 + N_sites_strand2). Likewise for piS. This simple calculation works for individual codons, whole ORFs, or even the full protein-coding genome. There are virtually limitless analyses or comparisons one can do, so it's important to form precise hypotheses ahead of time and focus on what comparisons/metrics are appropriate for addressing them. Let me know if that makes sense! Chase |
Hey Chase, thanks for all your help so far. Was wondering if I could run this past you, totally cool if not. I had a question regarding my results. I merged my results (+ and - strand) and was working through them. Below I have both strands for 2 of my samples. Do you also find it suspect that there are far more nonsyn mutations than there are syn mutations. I am worried that I did something wrong with the input because of the magnitude of difference between the two. From literature I thought that syn mutations should be far more common, but I am unsure. Thank you so much for your help
|
Greetings @bturne48 — no worries! I'm not exactly sure how you're concluding there are more nonsynonymous mutations, are you counting them somehow? Regardless, 3nonsyn:1syn is actually approximately the neutral expectation, i.e., ~75% of random mutations will be nonsynonymous given the nature of the genetic code. One place this is displayed is Graur & Li, Fundamentals of Molecular Evolution, 2nd Edition, Table 1.5 "Relative frequencies of different types of mutational substitutions in random protein-coding genes". I think of 75% nonsynonymous as being approximately dN/dS = 1. Let me know if that helps :-> |
Thanks for the speedy reply Chase. I was using the fields "N_sites" and "S_sites" for the example above, and noticing that there was (as you said) about 3x as many non-syn sites. Is this a misinterpretation of what those fields are reporting?
Also thank you for the literature, incredibly helpful!! |
You're most welcome! Yes, that is a misinterpretation. N_sites is the number of nonsynonymous sites in the sequence, not the number of nonsynonymous changes. Sites can be thought of as possible changes. Strongly recommend to familiarize with background material on molecular evolution in general and dN/dS in particular before interpreting analysis results, or enlist a molecular evolution collaborator. Great resources are the text I mentioned, or anything by Wen-Hsiung Li, Austin Hughes, Masatoshi Nei, Zvelebil & Baum, Asher D. Cutter, etc. Hope that helps! |
Hi i am trying to run snpgenie.pl with this command:
and am confused why i am receiving this error:
## WARNING: Conflicting SNP Report formats detected. Does the file extension match expectation? If not, please contact author. ## SNPGenie TERMINATED.
I have made sure to include the AD and DP tags in both the format and sample columns, and am unsure what to do.
Here are 4 lines from my vcf as an example. I can provide more information! Any help would be amazing thank you
The text was updated successfully, but these errors were encountered: