-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set log10bf to max if inf #201
Comments
Thanks for the suggestion @jerome-f. The logBFs should always be finite. Do you have an example that produces an infinite logBF? |
I am having a similar problem to this. I have summary stats from a meta-analysis and 1000G EUR as the LD reference. I have used bigsnpr::snp_match to ensure no mismatched alleles. I originally was getting an error about non-convergence after 100 iterations. I then removed any snps that had low sample size (snp N is quite variable) and this stopped the convergence error. However, I now get cs_log10bf = Inf:
Here is the susie plots: And the diagnostic plot (lambda is 0.52) : Any advice on this would be much appreciated? Is it likely that the reference panel is too small/meta-analysis is too heterogeneous for SuSie to work in this case? |
@jmmax If there is any way you could share your data + code, it would help us to reproduce the problem of infinite BFs, and pinpoint the bug. Regarding your specific fine-mapping results, removing the SNPs furthest away from the diagonal of the diagnostic plot might improve your results. |
if you are using in-sample LD then I would expect the observed and expected
z scores to match exactly, not just approximately.
(am i correct @pcarbo ?)
Matthew
…On Mon, Jul 8, 2024 at 7:52 AM oleg2153 ***@***.***> wrote:
Hi Peter,
I would ask here since my question is related to the infinite log10bg.
I am running susieR in a region with several very significant variants
(p<1e-5000). As a result, I am obtaining 10 credible sets with an infinite
"cs_log10bf" in 9 out of 10 sets (and each set consisted of a single
variable). I used a completely matched in-sample LD matrix (same
individuals, same variants). The genotypes are well-imputed and there are
no missing values. I checked the sanity of my input data according to the
SuSiE recommendations, the plot is attached - it looks good I assume?
Screenshot.from.2024-07-08.14-30-28.png (view on web)
<https://github.com/stephenslab/susieR/assets/45593402/9190455b-9976-4b96-80a8-dd08055f891f>
I ran susie_rss with verbose=TRUE. I saw that the "objective" in the
output usually "converges" at some value quickly, e.g.:
[1] "objective:-56012.4505928347"
[1] "objective:-56009.8247200001"
[1] "objective:-56009.8188294932"
[1] "objective:-56009.818824447"
In my analysis, it was:
[1] "objective:-48181.7247065402"
...
[1] "objective:-17785.0571885817"
[1] "objective:-17564.5765219636"
[1] "objective:-17345.7019319673"
IBSS algorithm did not converge in 100 iterations!
I did some follow-up tests: I multiplied the standard errors of GWAS
effect sizes by 10 and re-caclulated z-scores. In this case, susieR quickly
converged and produced reasonable summary (1 credible set with
cs_log10bf=50 and 30 variables). When trying "SE*3", I got 2 credible sets:
one had infinite cs_log10bf but still 11 variables in it. With "SE*2", I
got four credible sets, two of them with a single variable and infinite
cs_log10bf.
Could it be that the infinite "cs_log10bf" is related to the very strong
significance (i.e., very high z-scores)? If so, what could be a solution to
make susieR produce finite "cs_log10bf" (tweak some options)? If not, maybe
there are other things I could try?
Thanks,
Oleg
—
Reply to this email directly, view it on GitHub
<#201 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANXRRIIGWQ5SURMATIDVKTZLKDRFAVCNFSM6AAAAAA525POL6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJTHE4TMMBTGM>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Yes, if the everything is the same. See here for example. In any case the kriging plot looks pretty good (if not exactly on the diagonal) — I don't see major cause for concern. Since the associations are very significant, it is possible that the effects are quite large, and perhaps fixing the prior variance (see the "prior_variance" parameter) would improve the behaviour of IBSS. It is worth a try, IMO. |
i was concerned that if z scores are very large then even small deviations
in the kriging plot could maybe be a problem.
…On Mon, Jul 8, 2024 at 8:49 AM Peter Carbonetto ***@***.***> wrote:
Yes, if the everything is the same. See here
<https://stephenslab.github.io/susieR/articles/susierss_diagnostic.html>
for example. In any case the kriging plot looks pretty good (if not exactly
on the diagonal) — I don't see major cause for concern.
Since the associations are very significant, it is possible that the
effects are quite large, and perhaps fixing the prior variance (see the
"prior_variance" parameter) would improve the behaviour of IBSS. It is
worth a try, IMO.
—
Reply to this email directly, view it on GitHub
<#201 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANXRRPWNHARSONV4KBNOTTZLKKE7AVCNFSM6AAAAAA525POL6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJUGEZDSOJTHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Yes, that is possible. I don't think we have checked this. |
Thanks a lot, Matthew and Peter!
Will keep you posted! Cheers, |
Hi @oborisov thanks for sharing. Because your z-scores are so large, this seems very much like an "ill-conditioned problem" where small numerical errors can lead to large differences. (There could be a simpler explanation that hasn't occurred to me.) In any case this is a situation we haven't studied carefully, so this is interesting. If you could (somehow) share a small data set reproducing these errors/warnings, I'd be happy to take a look and perhaps I can pinpoint the cause. |
Thanks Peter! I simulated some data that I hope can help to reproduce the case above. I used the following shell code ("plink" corresponds to plink 1.9 (https://www.cog-genomics.org/plink/1.9/) and plink2 corresponds to plink 2.0 (https://www.cog-genomics.org/plink/2.0/))
It simulated 1 variant in 100,000 individuals with a very low p-value (p=4.55636e-6247). The simulated matrix of genotypes and phenotypes is attached (sim2.raw.gz). Then I used the following R code
Correlation matrix (first 4 variables)
Finally, I ran fine-mapping
Warning message: Variables in credible sets: variable variable_prob cs Credible sets summary: cs cs_log10bf cs_avg_r2 cs_min_r2 variable Thanks again, |
Wow @oborisov thanks for sharing this. I will try to reproduce this issue soon once I have access to my computer again. |
Thanks Peter! By the way, in the toy example above I additionally tried the Thanks again,
Variables in credible sets: variable variable_prob cs Credible sets summary: cs cs_log10bf cs_avg_r2 cs_min_r2 variable |
That's interesting, thank you for sharing. (It isn't unexpected that the log-Bayes factor is Inf.) |
Thanks for the reply! I initially thought that the cs_log10bf should always be finite. |
@oborisov coloc.susie uses lbf to actually compare two credible sets. So lbf being inf how did that work out. Did you change inf to some finite value before running coloc.susie ?? Check the |
Thanks @jerome-f for your reply. Next, if I see correctly, |
Ah! That makes sense. I got confused with lbf of CS vs lbf of variable. My bad. |
@oborisov Apologies for the delay — I believe I was able to reproduce your results in your toy example. This was the script I ran (with a few small changes to your code): library(susieR)
# function to create SNPs in strong LD
add_snp <- function(x, n_replace = 1000) {
ind_to_change <- sample(length(x), n_replace)
x[ind_to_change] <- sample(x[ind_to_change])
return(x)
}
# read genotype+phenotype data.
raw_gt <- read.table("sim2.raw.gz", header = TRUE)
# Add 10 more variants with strong LD.
variant_name <- grep("qtl", colnames(raw_gt),value = TRUE)
for (i in 1:10) {
set.seed(i + 1234)
raw_gt[[paste0("qtl_",i)]] <- add_snp(raw_gt[[variant_name]],
n_replace = 1000)
}
# Remove the first variant. It can also be kept, but then only 1
# credible set is identified. If removed, there are 2-3 CSs that
# better illustrate the case above.
raw_gt[[variant_name]] <- NULL
# create phenotype matrix (one variable)
pheno_mat <- raw_gt[,"PHENOTYPE"]
# create genotype matrix
n <- nrow(raw_gt)
p <- ncol(raw_gt)
raw_gt_mat <- as.matrix(raw_gt[,7:p])
storage.mode(raw_gt_mat) <- "numeric"
# Run susie.
fitted <- susie(raw_gt_mat,pheno_mat,verbose = TRUE)
all(is.finite(fitted$lbf_variable))
summary(fitted) It turns out that the log10 Bayes factors in the summary function are computed in a very silly way, and with the fix (see the latest version on GitHub) the log10BFs are now finite:
I also noticed that indeed the IBSS algorithm was making very slow progress on this data set, and I suspect that the IBSS algorithm has difficult due to the very strong LD and/or the very strong effects with PIPs very close to 1. In other words this sort of slow convergence is expected here. So at least this one problem has been solved (but it doesn't sound like this is the main problem you are struggling with). |
Next running susie_rss, # Calculate z-scores.
sumstats <- univariate_regression(raw_gt_mat, pheno_mat)
z_scores <- sumstats$betahat / sumstats$sebetahat
# LD matrix.
Rin <- cor(raw_gt_mat)
attr(Rin,"eigen") <- eigen(Rin,symmetric = TRUE)
# Run susie_rss.
fitted_rss <- susie_rss(z = z_scores,n = n,R = Rin,verbose = TRUE)
# Calculate z-scores.
sumstats <- univariate_regression(raw_gt_mat, pheno_mat)
z_scores <- sumstats$betahat / sumstats$sebetahat
# LD matrix.
Rin <- cor(raw_gt_mat)
attr(Rin,"eigen") <- eigen(Rin,symmetric = TRUE)
# Run susie_rss.
fitted_rss <- susie_rss(z = z_scores,n = n,R = Rin,verbose = TRUE)
summary(fitted_rss) I get this:
The susie and susie_rss results look quite consistent here (except for the BFs). In this case I suspect that the slow convergence is due to the LD structure, and not due to inconsistency between the LD and z-scores. (There are several reasons why IBSS should be slow to converge, and inconsistencies among the summary statistics is only one possible reason.) |
This thread is quite long and I lost track of the conversation — if there are other important issues that remain unresolved, please remind me! |
Thanks Peter! My question above is now resolved. To follow-up things that might useful for the package:
Also,
These both points are not critical to me anymore, just mentioning them out of curiosity. Thanks again for your work and best wishes, |
Thanks @oborisov for the feedback. susie_rss is expected to be faster than susie when n, number of individuals in the data set, is larger than p, the number of SNPs. As for the differences in the Bayes factors between susie and susie-rss, note that susie-rss applied to the z-scores implicitly makes certain assumptions about the analysis, and these assumptions are not necessarily made by susie (e.g., X and y are standardized); see our PLoS Genetics paper for these details. By contrast, if you run susie_rss with bhat and shat instead of the z-scores, the results should be exactly the same (up to numerical differences). I haven't looked at this very carefully, but overall my sense is that the differences in the numbers between the X and Y axes in your latest kriging plot are small relatively speaking. It is also possible that the tests do not work as well for larger z-scores. We haven't studied the behaviour of these tests very carefully in this setting, so I can't say for sure. |
Also please see this vignette which illustrates the different variants of susie_rss. |
I'm not sure I would expect IBSS to be slow just because of strong LD
and/or PIPs close to 1.
So maybe it is worth looking at this further (I hope I might get time to do
this eventually!)
…On Tue, Aug 6, 2024 at 4:43 PM Peter Carbonetto ***@***.***> wrote:
Also please see this vignette
<https://stephenslab.github.io/susieR/articles/susie_rss.html> which
illustrates the different variants of susie_rss.
—
Reply to this email directly, view it on GitHub
<#201 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANXRRIS57I2ALNV74UZZ6TZQDOJVAVCNFSM6AAAAAA525POL6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZRGQ3TANRUG4>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@stephens999 This is odd:
For some reason, susie identified two CSs for SNP 4, and the single effect estimates are of opposite sign. I think this is the source of the convergence difficulty. |
Hmm... That's potentially a bug right. It violates the "single" assumptions of single effect ? |
It could be a bug — I will look into it. |
It is not necessarily a bug - it doesn't violate the single effect
assumption because we don't impose a constraint that the different single
effects are not the same snp.
But this *is* unusual behavior, probably representing convergence to a poor
local optimum, and could be worth examining closer.
…On Thu, Aug 15, 2024, 20:30 Peter Carbonetto ***@***.***> wrote:
It could be a bug — I will look into it.
—
Reply to this email directly, view it on GitHub
<#201 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANXRROYGP65QV4X6KLMGMTZRTXUBAVCNFSM6AAAAAA525POL6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJRHEZDCMBZG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I believe this is an optimization issue, rather than a "bug" — I suspect that if you run the algorithm for "long enough", it might eventually converge to a better solution (but that might take days!). The current algorithm has difficulty dealing with the combination of many strong effects that are also very strongly correlated with each other. This doesn't happen often, which is probably why we haven't noticed it before, but it makes sense now that I see it. (And to be clear, the same issue arises with susie_rss.) I tried with More concretely, what it happening I believe is that it is overestimating the size of the first effect, and then compensates later by introducing a negative effect. And once it does this, it gets "stuck" in this local optimum. All the SNPs are very strongly correlated, so it doesn't make sense that the first single effect (the first row here) is very different from the others:
So this suggests that there is room for improvement in the optimization algorithm. |
I agree it's an optimization issue. The following sequential procedure seems to help (at least achieves a better optimum
I don't know if it will help in general - it may increase computation in general... |
This example illustrates the limitations of coordinate ascent. |
Hi Peter,
I noticed that cs_log10bf can sometimes have inf in the values. Would be good to reset it to data type maximum.
Best
Jerome
The text was updated successfully, but these errors were encountered: