Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set log10bf to max if inf #201

Open
jerome-f opened this issue Oct 10, 2023 · 33 comments
Open

Set log10bf to max if inf #201

jerome-f opened this issue Oct 10, 2023 · 33 comments
Labels
bug Something isn't working

Comments

@jerome-f
Copy link

Hi Peter,

I noticed that cs_log10bf can sometimes have inf in the values. Would be good to reset it to data type maximum.

Best
Jerome

@pcarbo
Copy link
Member

pcarbo commented Oct 11, 2023

Thanks for the suggestion @jerome-f. The logBFs should always be finite. Do you have an example that produces an infinite logBF?

@pcarbo pcarbo added the bug Something isn't working label Oct 11, 2023
@jmmax
Copy link

jmmax commented Feb 4, 2024

I am having a similar problem to this. I have summary stats from a meta-analysis and 1000G EUR as the LD reference. I have used bigsnpr::snp_match to ensure no mismatched alleles. I originally was getting an error about non-convergence after 100 iterations. I then removed any snps that had low sample size (snp N is quite variable) and this stopped the convergence error. However, I now get cs_log10bf = Inf:

cs cs_log10bf cs_avg_r2 cs_min_r2 variable
1 Inf 1 1 645
2 Inf 1 1 674
3 Inf 1 1 675
4 Inf 1 1 671

Here is the susie plots:

image

image

And the diagnostic plot (lambda is 0.52) :

image

Any advice on this would be much appreciated? Is it likely that the reference panel is too small/meta-analysis is too heterogeneous for SuSie to work in this case?

@pcarbo
Copy link
Member

pcarbo commented Feb 5, 2024

@jmmax If there is any way you could share your data + code, it would help us to reproduce the problem of infinite BFs, and pinpoint the bug.

Regarding your specific fine-mapping results, removing the SNPs furthest away from the diagonal of the diagnostic plot might improve your results.

@oborisov
Copy link

oborisov commented Jul 8, 2024

Hi Peter,

I would ask here since my question is related to the infinite log10bg.

I am running susieR in a region with several very significant variants (p<1e-5000). As a result, I am obtaining 10 credible sets with an infinite "cs_log10bf" in 9 out of 10 sets (and each set consisted of a single variable). I used a completely matched in-sample LD matrix (same individuals, same variants). The genotypes are well-imputed and there are no missing values. I checked the sanity of my input data according to the SuSiE recommendations, the plot is attached - it looks good I assume?
Screenshot from 2024-07-08 14-30-28

I ran susie_rss with verbose=TRUE. I saw that the "objective" in the output usually "converges" at some value quickly, e.g.:

[1] "objective:-56012.4505928347"
[1] "objective:-56009.8247200001"
[1] "objective:-56009.8188294932"
[1] "objective:-56009.818824447"

In my analysis, it was:

[1] "objective:-48181.7247065402"
...
[1] "objective:-17785.0571885817"
[1] "objective:-17564.5765219636"
[1] "objective:-17345.7019319673"
IBSS algorithm did not converge in 100 iterations!

I did some follow-up tests: I multiplied the standard errors of GWAS effect sizes by 10 and re-caclulated z-scores. In this case, susieR quickly converged and produced reasonable summary (1 credible set with cs_log10bf=50 and 30 variables). When trying "SE*3", I got 2 credible sets: one had infinite cs_log10bf but still 11 variables in it. With "SE*2", I got four credible sets, two of them with a single variable and infinite cs_log10bf.

Could it be that the infinite "cs_log10bf" is related to the very strong significance (i.e., very high z-scores)? If so, what could be a solution to make susieR produce finite "cs_log10bf" (tweak some options)? If not, maybe there are other things I could try?

Thanks,
Oleg

@stephens999
Copy link
Contributor

stephens999 commented Jul 8, 2024 via email

@pcarbo
Copy link
Member

pcarbo commented Jul 8, 2024

Yes, if the everything is the same. See here for example. In any case the kriging plot looks pretty good (if not exactly on the diagonal) — I don't see major cause for concern.

Since the associations are very significant, it is possible that the effects are quite large, and perhaps fixing the prior variance (see the "prior_variance" parameter) would improve the behaviour of IBSS. It is worth a try, IMO.

@stephens999
Copy link
Contributor

stephens999 commented Jul 8, 2024 via email

@pcarbo
Copy link
Member

pcarbo commented Jul 8, 2024

Yes, that is possible. I don't think we have checked this.

@oborisov
Copy link

oborisov commented Jul 8, 2024

Thanks a lot, Matthew and Peter!

  • I will then check again my LD matrix (alleles matching, etc) to ensure that the z-scores on the kriging plot are at the identity line. I had a similar idea that with very large effect sizes, any smallest mismatch between LD matrix and GWAS sumstats becomes substantial and prevents the algorithm to converge.
  • Also, thanks for the hint about "prior_variance" parameter, will look into it.

Will keep you posted!

Cheers,
Oleg

@oborisov
Copy link

oborisov commented Jul 15, 2024

Hi again,

Here are the results of a few tests I did:

  1. I took 4 SNPs with very significant p-value (<1e-5000), extracted the individual level data (approx N=40000), and respective phenotypes. I ensured that the genotype and phenotype matrix are matched (as in the test example for N3finemapping$X and N3finemapping$Y). Then I used the following code:
sumstats <- univariate_regression(raw_gt_mat, pheno_mat) # where raw_gt_mat is formatted as N3finemapping$X and pheno_mat is formatted as N3finemapping$Y
z_scores <- sumstats$betahat / sumstats$sebetahat
Rin = cor(raw_gt_mat)
attr(Rin, "eigen") = eigen(Rin, symmetric = TRUE)
condz_in = kriging_rss(z_scores, Rin, n=n)
condz_in$plot

I am attaching the resulting kriging plot.
kriging

Could you think of any reasons by z-scores are still not at the identity line?

  1. I tried to tweak the "prior_variance" parameter but it didn't help with the convergence.

Thanks for your help!

Best,
Oleg

@pcarbo
Copy link
Member

pcarbo commented Jul 16, 2024

Hi @oborisov thanks for sharing. Because your z-scores are so large, this seems very much like an "ill-conditioned problem" where small numerical errors can lead to large differences. (There could be a simpler explanation that hasn't occurred to me.)

In any case this is a situation we haven't studied carefully, so this is interesting. If you could (somehow) share a small data set reproducing these errors/warnings, I'd be happy to take a look and perhaps I can pinpoint the cause.

@oborisov
Copy link

Thanks Peter!

I simulated some data that I hope can help to reproduce the case above.

I used the following shell code ("plink" corresponds to plink 1.9 (https://www.cog-genomics.org/plink/1.9/) and plink2 corresponds to plink 2.0 (https://www.cog-genomics.org/plink/2.0/))

echo "1 qtl 0.05 0.95  0.25 0" > wgas.sim
plink --simulate-qt wgas.sim --make-bed --out sim1 --simulate-n 100000
plink2 --bfile sim1 --export A --glm allow-no-covars --out sim2
cat sim2.PHENO1.glm.linear | column -t

It simulated 1 variant in 100,000 individuals with a very low p-value (p=4.55636e-6247). The simulated matrix of genotypes and phenotypes is attached (sim2.raw.gz).
sim2.raw.gz

Then I used the following R code


### R code ###
library(susieR)
# function to create SNPs in strong LD
add_snp <- function(x, n_replace=1000) {
  ind_to_change <- sample(length(x), n_replace)
  x[ind_to_change] <- sample(x[ind_to_change])
  x
}
# read genotype+phenotype data
raw_gt <- read.table("sim2.raw.gz", header=T)
# add 10 more variants with strong LD
variant_name <- grep("qtl", colnames(raw_gt), value=T)
for (i in 1:10) {
  set.seed(i+1234)
  raw_gt[[paste0("qtl_", i)]] <- add_snp(raw_gt[[variant_name]], n_replace=1000)
}
# remove the first variant. It can also be kept, but then only 1 credible set is identified. If removed, there are 2-3 CSs that better illustrate the case above.
raw_gt[[variant_name]] <- NULL
# create phenotype matrix (one variable)
pheno_mat <- as.matrix(raw_gt[,c("PHENOTYPE")])
# create genotype matrix
raw_gt_mat <- raw_gt[,7:ncol(raw_gt)]
n <- nrow(raw_gt_mat)
# calcualte z-scores
sumstats <- univariate_regression(raw_gt_mat, pheno_mat)
z_scores <- sumstats$betahat / sumstats$sebetahat
# correlation matrix
Rin = cor(raw_gt_mat)
attr(Rin, "eigen") = eigen(Rin, symmetric = TRUE)

Correlation matrix (first 4 variables)
qtl_1 qtl_2 qtl_3 qtl_4
qtl_1 1.0000000 0.9807084 0.9807084 0.9814978
qtl_2 0.9807084 1.0000000 0.9797956 0.9805111
qtl_3 0.9807084 0.9797956 1.0000000 0.9803630
qtl_4 0.9814978 0.9805111 0.9803630 1.0000000

# kriging plot
condz_in = kriging_rss(z_scores, Rin, n=n)
condz_in$plot
dev.off()

The plot is attached.
kriging

Finally, I ran fine-mapping

# susie_rss
fitted_rss <- susie_rss(z = z_scores,
                        n = n,
                        R = Rin)
summary(fitted_rss)

Warning message:
In susie_suff_stat(XtX = XtX, Xty = Xty, n = n, yty = (n - 1) * :
IBSS algorithm did not converge in 100 iterations!
Please check consistency between summary statistics and LD matrix.
See https://stephenslab.github.io/susieR/articles/susierss_diagnostic.html

Variables in credible sets:

variable variable_prob cs
4 0.9999999 1
8 0.9999955 2
3 0.5724716 3
1 0.3996082 3

Credible sets summary:

cs cs_log10bf cs_avg_r2 cs_min_r2 variable
1 Inf 1.000000 1.000000 4
2 Inf 1.000000 1.000000 8
3 214.1423 0.961789 0.961789 1,3

Thanks again,
Oleg

@pcarbo
Copy link
Member

pcarbo commented Jul 18, 2024

Wow @oborisov thanks for sharing this. I will try to reproduce this issue soon once I have access to my computer again.

@oborisov
Copy link

Thanks Peter!

By the way, in the toy example above I additionally tried the susie() method with the simulated individual level data which gave me cs_log10bf=Inf. Using the real data, susie seems to converge much faster than susie_rss and give different output (more "reasonable"?): e.g., susie_rss is giving 10 CSs each with one variable and all with cs_log10bf=Inf, while susie is giving six CSs with different number of variables (from 1 to 15) and only two have cs_log10bf=Inf while other four have finite values. (Maybe it would be easier to reproduce the behavior with susie rather than susie_rss.)

Thanks again,
Oleg

# susie
raw_gt_mat <- sapply(raw_gt_mat, as.numeric)
fitted <- susie(raw_gt_mat, as.matrix(pheno_mat))
summary(fitted)

Variables in credible sets:

variable variable_prob cs
4 1.0000000 1
8 0.9999963 2
1 0.5906970 4
3 0.4084430 4

Credible sets summary:

cs cs_log10bf cs_avg_r2 cs_min_r2 variable
1 Inf 1.000000 1.000000 4
2 Inf 1.000000 1.000000 8
4 Inf 0.961789 0.961789 1,3

@pcarbo
Copy link
Member

pcarbo commented Jul 22, 2024

That's interesting, thank you for sharing. (It isn't unexpected that the log-Bayes factor is Inf.)

@oborisov
Copy link

Thanks for the reply! I initially thought that the cs_log10bf should always be finite.
#201 (comment)
So even if we get cs_log10bf=Inf, we can still consider the identified CSs as reliable, right? (Given that the input is correct, i.e., the individual level data was used so there can be no inconsistency between the LD matrix and sumstats).
Actually, in my analyses, I tried to use these CSs (with Inf cs_log10bf) in the downstream colocalization (coloc.susie) which produced reasonable and expected results.

@jerome-f
Copy link
Author

jerome-f commented Jul 23, 2024

@oborisov coloc.susie uses lbf to actually compare two credible sets. So lbf being inf how did that work out. Did you change inf to some finite value before running coloc.susie ?? Check the coloc.bf_bf function.

@oborisov
Copy link

Thanks @jerome-f for your reply.
I was observing Inf values when applying summary method to the fitted object (output of susie()), as in the code in my message above. I now looked into the source code of summary.susie() function. These "Inf" values are likely produced in line 35, namely by log10(exp(object$lbf[cs$cs[i]]))
https://github.com/stephenslab/susieR/blob/master/R/summary.susie.R#L35C12-L35C22
I looked into the actual lbf values and they were finite (fitted$lbf).

Next, if I see correctly, coloc.bf_bf takes lbf per variable, i.e., object$lbf_variable
https://github.com/chr1swallace/coloc/blob/main/R/susie.R#L91-L95
I checked fitted$lbf_variable matrix and there were no Inf values. Therefore, coloc.susie should work fine I assume.

@jerome-f
Copy link
Author

Ah! That makes sense. I got confused with lbf of CS vs lbf of variable. My bad.

@pcarbo
Copy link
Member

pcarbo commented Aug 1, 2024

@oborisov Apologies for the delay — I believe I was able to reproduce your results in your toy example. This was the script I ran (with a few small changes to your code):

library(susieR)

# function to create SNPs in strong LD
add_snp <- function(x, n_replace = 1000) {
  ind_to_change    <- sample(length(x), n_replace)
  x[ind_to_change] <- sample(x[ind_to_change])
  return(x)
}

# read genotype+phenotype data.
raw_gt <- read.table("sim2.raw.gz", header = TRUE)

# Add 10 more variants with strong LD.
variant_name <- grep("qtl", colnames(raw_gt),value = TRUE)
for (i in 1:10) {
  set.seed(i + 1234)
  raw_gt[[paste0("qtl_",i)]] <- add_snp(raw_gt[[variant_name]],
                                        n_replace = 1000)
}

# Remove the first variant. It can also be kept, but then only 1
# credible set is identified. If removed, there are 2-3 CSs that
# better illustrate the case above.
raw_gt[[variant_name]] <- NULL

# create phenotype matrix (one variable)
pheno_mat <- raw_gt[,"PHENOTYPE"]

# create genotype matrix
n <- nrow(raw_gt)
p <- ncol(raw_gt)
raw_gt_mat <- as.matrix(raw_gt[,7:p])
storage.mode(raw_gt_mat) <- "numeric"

# Run susie.
fitted <- susie(raw_gt_mat,pheno_mat,verbose = TRUE)
all(is.finite(fitted$lbf_variable))
summary(fitted)

It turns out that the log10 Bayes factors in the summary function are computed in a very silly way, and with the fix (see the latest version on GitHub) the log10BFs are now finite:

> summary(fitted)
Variables in credible sets:
 variable variable_prob cs
        4     1.0000000  1
        8     0.9999963  2
        1     0.5906970  4
        3     0.4084430  4
Credible sets summary:
 cs cs_log10bf cs_avg_r2 cs_min_r2 variable
  1  3098.6552  1.000000  1.000000        4
  2   779.3524  1.000000  1.000000        8
  4   661.2822  0.961789  0.961789      1,3

I also noticed that indeed the IBSS algorithm was making very slow progress on this data set, and I suspect that the IBSS algorithm has difficult due to the very strong LD and/or the very strong effects with PIPs very close to 1. In other words this sort of slow convergence is expected here.

So at least this one problem has been solved (but it doesn't sound like this is the main problem you are struggling with).

@pcarbo
Copy link
Member

pcarbo commented Aug 1, 2024

Next running susie_rss,

# Calculate z-scores.
sumstats <- univariate_regression(raw_gt_mat, pheno_mat)
z_scores <- sumstats$betahat / sumstats$sebetahat

# LD matrix.
Rin <- cor(raw_gt_mat)
attr(Rin,"eigen") <- eigen(Rin,symmetric = TRUE)

# Run susie_rss.
fitted_rss <- susie_rss(z = z_scores,n = n,R = Rin,verbose = TRUE)
# Calculate z-scores.
sumstats <- univariate_regression(raw_gt_mat, pheno_mat)
z_scores <- sumstats$betahat / sumstats$sebetahat

# LD matrix.
Rin <- cor(raw_gt_mat)
attr(Rin,"eigen") <- eigen(Rin,symmetric = TRUE)

# Run susie_rss.
fitted_rss <- susie_rss(z = z_scores,n = n,R = Rin,verbose = TRUE)
summary(fitted_rss)

I get this:

> summary(fitted)
Variables in credible sets:
 variable variable_prob cs
        4     1.0000000  1
        8     0.9999963  2
        1     0.5906970  4
        3     0.4084430  4
Credible sets summary:

 cs cs_log10bf cs_avg_r2 cs_min_r2 variable
  1  3098.6552  1.000000  1.000000        4
  2   779.3524  1.000000  1.000000        8
  4   661.2822  0.961789  0.961789      1,3

The susie and susie_rss results look quite consistent here (except for the BFs).

In this case I suspect that the slow convergence is due to the LD structure, and not due to inconsistency between the LD and z-scores. (There are several reasons why IBSS should be slow to converge, and inconsistencies among the summary statistics is only one possible reason.)

@pcarbo
Copy link
Member

pcarbo commented Aug 1, 2024

This thread is quite long and I lost track of the conversation — if there are other important issues that remain unresolved, please remind me!

@oborisov
Copy link

oborisov commented Aug 4, 2024

Thanks Peter!

My question above is now resolved.

To follow-up things that might useful for the package:

  1. I tried my code above with the updated susieR version. I am getting exactly the same results as you when running susie(). When running susie_rss(), I am getting different numeric output but with the same variables selected (I guess that's what you meant by saying "except for the BFs"). I guess these are numeric differences and the results are essentially same.

Also, susie_rss() was faster than susie().

Variables in credible sets:

 variable variable_prob cs
        4     0.9999999  1
        8     0.9999955  2
        3     0.5724716  3
        1     0.3996082  3

Credible sets summary:

 cs cs_log10bf cs_avg_r2 cs_min_r2 variable
  1   813.6982  1.000000  1.000000        4
  2   948.8904  1.000000  1.000000        8
  3   214.1423  0.961789  0.961789      1,3
  1. I am still unsure why the kriging_rss plot (above) did not show identical expected and observed z-scores.

These both points are not critical to me anymore, just mentioning them out of curiosity.

Thanks again for your work and best wishes,
Oleg

@pcarbo
Copy link
Member

pcarbo commented Aug 6, 2024

Thanks @oborisov for the feedback.

susie_rss is expected to be faster than susie when n, number of individuals in the data set, is larger than p, the number of SNPs.

As for the differences in the Bayes factors between susie and susie-rss, note that susie-rss applied to the z-scores implicitly makes certain assumptions about the analysis, and these assumptions are not necessarily made by susie (e.g., X and y are standardized); see our PLoS Genetics paper for these details. By contrast, if you run susie_rss with bhat and shat instead of the z-scores, the results should be exactly the same (up to numerical differences).

I haven't looked at this very carefully, but overall my sense is that the differences in the numbers between the X and Y axes in your latest kriging plot are small relatively speaking. It is also possible that the tests do not work as well for larger z-scores. We haven't studied the behaviour of these tests very carefully in this setting, so I can't say for sure.

@pcarbo
Copy link
Member

pcarbo commented Aug 6, 2024

Also please see this vignette which illustrates the different variants of susie_rss.

@stephens999
Copy link
Contributor

stephens999 commented Aug 14, 2024 via email

@pcarbo
Copy link
Member

pcarbo commented Aug 14, 2024

@stephens999 This is odd:

> fitted <- susie(raw_gt_mat,pheno_mat,L = 4,verbose = TRUE)
> round(fitted$alpha,digits=3)
     qtl_1 qtl_2 qtl_3 qtl_4 qtl_5 qtl_6 qtl_7 qtl_8 qtl_9 qtl_10
[1,] 0.000     0 0.000     1     0     0     0     0 0.000      0
[2,] 0.000     0 0.000     0     0     0     0     1 0.000      0
[3,] 0.000     0 0.000     1     0     0     0     0 0.000      0
[4,] 0.591     0 0.408     0     0     0     0     0 0.001      0
> round(fitted$mu,digits=3)
      qtl_1  qtl_2  qtl_3  qtl_4  qtl_5  qtl_6  qtl_7  qtl_8  qtl_9 qtl_10
[1,]  0.324  0.325  0.324  0.328  0.324  0.325  0.324  0.322  0.325  0.324
[2,]  0.163  0.164  0.163  0.162  0.164  0.164  0.164  0.165  0.164  0.164
[3,] -0.138 -0.137 -0.137 -0.142 -0.137 -0.137 -0.137 -0.139 -0.136 -0.137
[4,]  0.152  0.151  0.152  0.149  0.151  0.151  0.151  0.149  0.152  0.151

For some reason, susie identified two CSs for SNP 4, and the single effect estimates are of opposite sign. I think this is the source of the convergence difficulty.

@jerome-f
Copy link
Author

Hmm... That's potentially a bug right. It violates the "single" assumptions of single effect ?

@pcarbo
Copy link
Member

pcarbo commented Aug 15, 2024

It could be a bug — I will look into it.

@stephens999
Copy link
Contributor

stephens999 commented Aug 17, 2024 via email

@pcarbo
Copy link
Member

pcarbo commented Aug 19, 2024

I believe this is an optimization issue, rather than a "bug" — I suspect that if you run the algorithm for "long enough", it might eventually converge to a better solution (but that might take days!). The current algorithm has difficulty dealing with the combination of many strong effects that are also very strongly correlated with each other. This doesn't happen often, which is probably why we haven't noticed it before, but it makes sense now that I see it.

(And to be clear, the same issue arises with susie_rss.)

I tried with refine = TRUE, and it didn't fix the problem.

More concretely, what it happening I believe is that it is overestimating the size of the first effect, and then compensates later by introducing a negative effect. And once it does this, it gets "stuck" in this local optimum. All the SNPs are very strongly correlated, so it doesn't make sense that the first single effect (the first row here) is very different from the others:

> round(fitted$mu,digits = 3)
      qtl_1  qtl_2  qtl_3  qtl_4  qtl_5  qtl_6  qtl_7  qtl_8  qtl_9 qtl_10
[1,]  0.338  0.339  0.338  0.343  0.339  0.339  0.338  0.336  0.339  0.338
[2,]  0.161  0.162  0.161  0.160  0.162  0.162  0.162  0.163  0.162  0.162
[3,] -0.154 -0.152 -0.153 -0.158 -0.152 -0.153 -0.153 -0.155 -0.152 -0.153
[4,]  0.155  0.155  0.155  0.152  0.155  0.155  0.154  0.152  0.155  0.154

So this suggests that there is room for improvement in the optimization algorithm.

@stephens999
Copy link
Contributor

I agree it's an optimization issue.
(I don't think it is weird for the first single effect to look very different from the subsequent single effects though; it would be weird if they looked the same, because the subsequent single effects are supposed
to capture signals not captured in the first single effect.)
It's a little surprising to me that refine=T did not help here.

The following sequential procedure seems to help (at least achieves a better optimum
and avoids including the same variable in two different CSs):

fitted.1 <- susie(raw_gt_mat,pheno_mat,verbose = TRUE,L = 1)
fitted.2 <- susie(raw_gt_mat,pheno_mat,verbose = TRUE,s_init = fitted.1, L = 2)
fitted.3 <- susie(raw_gt_mat,pheno_mat,verbose = TRUE,s_init = fitted.2, L = 3)
fitted.4 <- susie(raw_gt_mat,pheno_mat,verbose = TRUE,s_init = fitted.3, L = 4)
summary(fitted.4)

I don't know if it will help in general - it may increase computation in general...

@pcarbo
Copy link
Member

pcarbo commented Aug 26, 2024

This example illustrates the limitations of coordinate ascent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants