Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about the principle of computation #78

Open
YangJingqii opened this issue Nov 19, 2024 · 7 comments
Open

Questions about the principle of computation #78

YangJingqii opened this issue Nov 19, 2024 · 7 comments

Comments

@YangJingqii
Copy link

Hi,
I have some questions when using these scripts.

  1. What are the functions of bootstrap in snpgenie_within_group.pl and SNPGenie_sliding_windows.R respectively, why do they need to be performed twice? How are they calculated? And when I run./SNPGenie_sliding_windows.R G_codon.txt N S 10 1 1000 100 NONE 6 > sw_G.out, I get an error
Warning messages:
1: Unknown or uninitialised column: `codon_num`. 
2: Unknown or uninitialised column: `codon_num`. 
Error in `filter()`:
ℹ In argument: `num_defined_seqs >= MIN_DEFINED_CODONS`.
Caused by error:
! object 'num_defined_seqs' not found
Backtrace:
     ▆
  1. ├─dplyr::filter(codon_data, num_defined_seqs >= MIN_DEFINED_CODONS)
  2. ├─dplyr:::filter.data.frame(codon_data, num_defined_seqs >= MIN_DEFINED_CODONS)
  3. │ └─dplyr:::filter_rows(.data, dots, by)
  4. │   └─dplyr:::filter_eval(...)
  5. │     ├─base::withCallingHandlers(...)
  6. │     └─mask$eval_all_filter(dots, env_filter)
  7. │       └─dplyr (local) eval()
  8. └─base::.handleSimpleError(...)
  9.   └─dplyr (local) h(simpleError(msg, call))
 10.     └─rlang::abort(message, class = error_class, parent = parent, call = error_call)
Execution halted
  1. Why was snpgenie.pl in the deep sequencing sample not bootstrap?
    3.Designed for use with sequence data that outsizes what can be handled by other software platforms, I was wondering when does it not apply? Does calculating dn/ds without using phylogenetic tree create inaccuracies?
  2. What is the difference between πN/ πS and dN/dS calculated in snpgenie. I have checked Within-host nucleotide diversity of virus populations: Insights from next-generation sequencing No answer can be found, what is the calculation formula
  3. What is the basis for the calculation of dn/ds in snpgenie_within_group.pl?

I realize I have raised quite a number of questions, and I truly appreciate your patience and expertise in helping me understand these important aspects of the analysis.
Thank you very much!

@singing-scientist
Copy link
Contributor

Greetings, @YangJingqii ! Thanks for the questions.

  1. snpgenie_within_group calculates the π values for each codon, whereas the bootstrap script performs the statistical test of πN=πS much more efficiently. If I did it all over again I'd probably use Python and combine it all into one, but unfortunately this two-step process is what I have. Regarding the error, please feel free to provide a minimum reproducible example so I can address it.
  2. snpgenie.pl doesn't itself bootstrap because it's a computationally intensive second step (see 1) but also because it's often not necessary with intrahost samples. The reason is that the sample rather than the codon is the unit of observation. Thus, because each sample has both a πN and πS, you can do standard statistics like a paired Wilcoxon test of πN = πS. I'm not sure what the second part of the question is asking, but using trees can be both more powerful when done well, but also assumes the tree is correct and therefore can also be more subject to error. Usually with intrahost samples there's so little variation as to make trees worthless.
  3. πN/πS and mean dN/dS are the same basic calculation, except sometimes a correction like Jukes-Cantor is applied for the former. Such correction is not necessary for the very low levels of divergence seen in intrahost samples. If you're instead asking about dN and dS vs. the reference sequence, this is no longer all pairwise comparisons, but instead all sequence reads vs. the reference sequence. It's a measure of divergence (from the REF) rather than diversity.
  4. I don't understand the question, what is the basis — what is meant by 'basis'?

Thanks!
Chase

@YangJingqii
Copy link
Author

snpgenie_within_group calculates the π values for each codon,I want to know if we can calculate piN/piS value for each codon. However, I'm uncertain whether the piN/piS value should be 0 or undefined when S_sites = 0, N_diffs = 0, and N_sites = 3 for a codon.
image


@singing-scientist
Copy link
Contributor

@YangJingqii thank you, it is possible to calculate πN/πS for individual codons, but there can be substantial uncertainty, and in the extreme many codons will have πS=0 or πS undefined. In the instance you show, πS is undefined, and πN/πS is undefined as well. Depending on whether you're just offering descriptive statistics or trying to test a specific hypothesis, other approaches with πN/πS (i.e. identifying a priori meaningful groups of codons which can be bootstrapped) or even dN/dS may be necessary.

@YangJingqii
Copy link
Author

Thank you for your response very much!
We know the formula for calculating diversity is
image.
If I only want to calculate the diversity of variants with a frequency lower than 50%, for example, at a certain site where the reference is C (count 470), one alt is A (count 15), and another alt is T (count 800), in the case of variants with a frequency below 50%, T would not be included. If I directly delete this row, will snpgenie mistakenly count these 800 reads as reference C, resulting in an incorrect calculation? If I want to calculate the diversity of only those variants with a frequency below 50%, can snpgenie be used? What form should I use for this(In the second table, the count of T is subtracted from the coverage. In the third table, the count of T and C is subtracted from the coverage. )? Or is it that snpgenie can only calculate the diversity of all variants at this site and cannot specify just a subset (since reference reads are always included, and the reference reads are often greater than 50%)? If there is only one low-frequency variant at a site, is the diversity for that site considered to be 1? Or does it have to be calculated together with the reference?However, in the paper "Within-host genetic diversity of SARS-CoV-2 lineages in unvaccinated and vaccinated individuals", snpgenie was used to calculate the diversity of variants below 50%, which confuses me. Could you please clarify? Thank you very much!

<style> </style>
Reference Position Type Reference Allele Count Coverage Frequency Overlapping annotations
2290 SNV C A 15 1285 1.167315175 CDS: N
<style> </style>
Reference Position Type Reference Allele Count Coverage Frequency Overlapping annotations    
2290 SNV C A 15 485 1.167315175 CDS: N    
<style> </style>
Reference Position Type Reference Allele Count Coverage Frequency Overlapping annotations
2290 SNV C A 15 15 1.167315175 CDS: N

@singing-scientist
Copy link
Contributor

Hello @YangJingqii !

Regarding the possibility of ignoring certain variant frequencies for calculating π, this is no longer π. In other words, π is a property of a site, not a property of particular variants. (The one exception is excluding certain variants or frequencies which you suspect to be sequencing errors, e.g., AF<1%.) What you are suggesting is to calculate the mean number of pairwise differences among only a subset of variants, which is not customary and would probably be difficult to interpret. It's always possible to invent a new metric, but I'm not sure of your goal or what the meaning of the resultant metric would be.

Regarding the paper you cite, I am not a coauthor so I do not know what the authors mean by only calculating variation among variants lower than 50%. Looking at the paper, I don't see a statement like that in the main text. Could you point out which part you are referring to?

Chase

@YangJingqii
Copy link
Author

Thank you for your response very much!This part of the article is mainly about these two sentences:

For determining reliable within-host mutations, quality filtering steps were developed and validated using technical control samples (Methods). We identified 28,353 iSNVs, with allele frequency between 2.5% and 50%, at 13,608 sites from 2560 (90.8%) samples.

Selection analysis
The nucleotide diversity can be separately calculated for adjusted synonymous and non-synonymous changes in coding regions. We calculated the
and
in this study using SNPGenie46 with self-curated input vcf files based on the above identified iSNVs.

@singing-scientist
Copy link
Contributor

Interesting! Not sure about that. If you want to know what they did you may have to ask the authors. But the correct thing is to include all iSNVs which pass QC, regardless of frequency, in the π calculation. For example, there is technically no difference between a C>T iSNV that has AF=5% and a T>C iSNV that has AF=95%; the only difference is which allele is considered REF, and which is considered ALT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants