Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Also encountered "The estimated prior variance is unreasonably large" using ieugwasr ld_matrix_local function to calculate the ld matrix #223

Open
laleoarrow opened this issue Apr 11, 2024 · 20 comments

Comments

@laleoarrow
Copy link

Also encountered "The estimated prior variance is unreasonably large" using ieugwasr ld_matrix_local function to calculate the ld matrix. As I have checked the original code of ld_matrix_local function in ieugwasr package in GitHub, when calculating the ld matrix ld_matrix_local does have "--keep-allele-order" argument in the plink command. But such error still exists. Please advice.
image

@laleoarrow
Copy link
Author

FYI my session info:
`> sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.4.1

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] LDlinkR_1.3.0 plinkbinr_0.0.0.9000 ieugwasr_0.2.2-9000 TwoSampleMR_0.5.10 coloc_5.2.3
[6] vroom_1.6.5 data.table_1.15.4 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[11] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[16] ggplot2_3.5.0 tidyverse_2.0.0

loaded via a namespace (and not attached):
[1] gtable_0.3.4 htmlwidgets_1.6.4 ggrepel_0.9.5 lattice_0.22-5 tzdb_0.4.0
[6] vctrs_0.6.5 tools_4.3.3 generics_0.1.3 parallel_4.3.3 curl_5.2.1
[11] fansi_1.0.6 pkgconfig_2.0.3 Matrix_1.6-5 ggnewscale_0.4.10 RcppParallel_5.1.7
[16] uuid_1.2-0 lifecycle_1.0.4 compiler_4.3.3 munsell_0.5.0 htmltools_0.5.7
[21] pillar_1.9.0 crayon_1.5.2 viridis_0.6.5 tidyselect_1.2.0 digest_0.6.34
[26] stringi_1.8.3 fastmap_1.1.1 grid_4.3.3 colorspace_2.1-0 cli_3.6.2
[31] magrittr_2.0.3 patchwork_1.2.0 Rfast_2.1.0 utf8_1.2.4 withr_3.0.0
[36] scales_1.3.0 rsthemes_0.4.0 RcppZiggurat_0.1.6 bit64_4.0.5 timechange_0.3.0
[41] httr_1.4.7 matrixStats_1.2.0 bit_4.0.5 gridExtra_2.3 hms_1.1.3
[46] plink2R_1.1 irlba_2.3.5.1 viridisLite_0.4.2 geni.plots_0.0.9 rlang_1.1.3
[51] ggiraph_0.8.9 Rcpp_1.0.12 mixsqp_0.3-54 glue_1.7.0 susieR_0.12.40
[56] rstudioapi_0.15.0 reshape_0.8.9 jsonlite_1.8.8 R6_2.5.1 plyr_1.8.9
[61] systemfonts_1.0.6 `

@pcarbo
Copy link
Member

pcarbo commented Apr 11, 2024

@leoarrow1 If you haven't already done so, I would recommend checking for inconsistencies between the LD matrix and the matrix z-scores following this vignette.

@laleoarrow
Copy link
Author

laleoarrow commented Apr 11, 2024

@leoarrow1 If you haven't already done so, I would recommend checking for inconsistencies between the LD matrix and the matrix z-scores following this vignette.

Thanks for your prompt response. Yes I have check this vignette as well as the former similar post on this issue. The solution was to add the "Adding the flag --keep-allele-order" in plink. But I have checked the ieugwasr code, it does have this flag in it. So this problem might be cause due to something else? Could you perhaps spot anything unusual in my code? dataset is a standard format that is suitable for coloc analysis. (i.e., Only missing the LD matrix for color-susie)

Coloc_to_SuSiE <- function( # my self-made function to change coloc data into coloc_susie data
    dataset, 
    # ld_method = "plink", 
    bfile = "/Users/xxx/project/ref/1kg.v3/EUR"
    ) {
dataset$LD <- ieugwasr::ld_matrix_local(
                           dataset$snp,
                           bfile = bfile,
                           plink_bin = plinkbinr::get_plink_exe(),
                           with_alleles = FALSE
                           )

ld_rows <- rownames(dataset[["LD"]])
snps <- dataset[["snp"]]
index <- match(ld_rows, snps)

for (i in names(dataset)) {
    if (i %in% c("LD", "type", "s", "N")) next
    if (length(dataset[[i]]) > ld_rows_number & length(dataset[[i]]) > 1) {
      dataset[[i]] <- dataset[[i]][index]
      if (length(dataset[[i]]) == ld_rows_number) {leo_message("91", paste0("The SNP in element <", i, "> is now matched with the LD matrix"))}
    }
dataset$N <- as.integer(dataset$N)[1]
  }
  return(dataset)
}

# the error occurs here
dat = Coloc_to_SuSiE(dataset)
S3 = runsusie(dat) # here.

Any thought is highly appreciated.

@pcarbo
Copy link
Member

pcarbo commented Apr 12, 2024

@leoarrow1 To better pinpoint the cause of the error, I would try to run susie_rss directly instead of through coloc.

@laleoarrow
Copy link
Author

@leoarrow1 To better pinpoint the cause of the error, I would try to run susie_rss directly instead of through coloc.

Thanks! I will try so as soon as possible and let you know the result.

@laleoarrow
Copy link
Author

laleoarrow commented Apr 15, 2024

Hi! I have tried running susie_rss directly which did yielded a result/ But the it did not converge as well, and this outcome is evidently incorrect (there is obviously too many cs?).
image
Upon reviewing the vignette, I identified inconsistencies in my susie_rss input data but, regrettably, I cannot locate their source.
image
I have checked the order of z and ld (rowname of ld) provided for susie_rss is in the same order though. Here is the code I used for susie_rss. Can you spot anything incorrectly? Any help is appreciated!

variants = dataset$snp 
# > head(variants)
# [1] "rs112438356" "rs112327275" "rs9270475"   "rs77122291"  "rs9270487"   "rs9270493" 
bfile = "/Users/leoarrow/project/ref/1kg.v3/EUR" # downloaded from https://mrcieu.github.io/ieugwasr/articles/local_ld.html
plink_bin = plinkbinr::get_plink_exe()

dataset$LD  <- ieugwasr::ld_matrix_local(
                           dataset$snp,
                           bfile = bfile,
                           plink_bin = plinkbinr::get_plink_exe(),
                           with_alleles = FALSE
                           )

ld_rows <- rownames(dataset[["LD"]]); ld_rows_number <- length(ld_rows)
snps <- dataset[["snp"]]; input_snps_number <- length(snps)
index <- match(ld_rows, snps)

susie_fit = susie_rss(dataset$z, dataset$LD, n=n)
susie_plot(susie_fit, y='PIP')
# lambda = estimate_s_rss(dataset$z, dataset$LD, n=n)
# lambda
condz = kriging_rss(dataset$z, dataset$LD, n=n)
condz$plot

@laleoarrow
Copy link
Author

laleoarrow commented Apr 15, 2024

FYI, the source code of ieugwasr::ld_matrix_local is as below for ld matrix calculation. It use plink for calculation but it does have the --keep-allele-order tag in it: (Ref:https://github.com/MRCIEU/ieugwasr/blob/master/R/ld_matrix.R)

ld_matrix_local <- function(variants, bfile, plink_bin, with_alleles=TRUE) {
	# Make textfile
	shell <- ifelse(Sys.info()['sysname'] == "Windows", "cmd", "sh")
	fn <- tempfile()
	write.table(data.frame(variants), file=fn, row.names=FALSE, col.names=FALSE, quote=FALSE)

	
	fun1 <- paste0(
		shQuote(plink_bin, type=shell),
		" --bfile ", shQuote(bfile, type=shell),
		" --extract ", shQuote(fn, type=shell), 
		" --make-just-bim ", 
		" --keep-allele-order ",
		" --out ", shQuote(fn, type=shell)
	)
	system(fun1)

	bim <- read.table(paste0(fn, ".bim"), stringsAsFactors=FALSE)

	fun2 <- paste0(
		shQuote(plink_bin, type=shell),
		" --bfile ", shQuote(bfile, type=shell),
		" --extract ", shQuote(fn, type=shell), 
		" --r square ", 
		" --keep-allele-order ",
		" --out ", shQuote(fn, type=shell)
	)
	system(fun2)
	res <- read.table(paste0(fn, ".ld"), header=FALSE) %>% as.matrix
	if(with_alleles)
	{
		rownames(res) <- colnames(res) <- paste(bim$V2, bim$V5, bim$V6, sep="_")
	} else {
		rownames(res) <- colnames(res) <- bim$V2
	}
	return(res)
}

It looks all right to me. Or is there any other option to do LD matrix calculation?
PS. I'm using gwas summary data and I do not have the genotype data. Also coloc recommend to use 1000 or above snps in the coloc loci, so LD matrix can only be calculated using local method?

@pcarbo
Copy link
Member

pcarbo commented Apr 15, 2024

@leoarrow1 I'm not familiar with the ieugwasr R package (I have not used it before), but if you compute the LD matrix from the same genotype data used in computing the association statistics (e.g., z-scores), there should be no inconsistencies in your kriging_rss plot. On the other hand, if the LD matrix is computed using different genotype data, then it is quite possible to obtain large inconsistencies. Removing the largest inconsistencies may involve removing some SNPs, but we have not (yet) developed a systematic procedure for removing large inconsistencies. Hope this helps (a bit).

@laleoarrow
Copy link
Author

laleoarrow commented Apr 16, 2024 via email

@pcarbo
Copy link
Member

pcarbo commented Apr 16, 2024

@leoarrow1 It would depend whether you have access to the same data that was used to generate the association statistics. If the answer is no, then we do not have precise recommendations except that you should work with genotype data that most closely resembles the data used to generate the association statistics. Perhaps @gaow can provide more detailed recommendations; I believe he is working on bioinformatics pipelines to help with this.

@laleoarrow
Copy link
Author

laleoarrow commented Apr 17, 2024 via email

@akhilpampana
Copy link

Hello, I am also facing a similar issue with ld matrix inconsistency. I have used PLINK to generate LD matrices and kept --keep-allele-order. I am using the meta effect of gwas analysis and using 1000g as a reference for LD. My correlation plot is off. Can you help me out with this issue? Here is the plot for the reference. I have used a 1000g complete dataset as the gwas results are from multi-ancestry datasets.
image

I have also noticed a thing like variant which is shown to have significant tophit for the loci is not available in the credible set. Is it the opposite that we should have that top variant in the credible set?

Regards
Akhil

@stephens999
Copy link
Contributor

stephens999 commented Apr 19, 2024 via email

@akhilpampana
Copy link

Thank you so much for the response. I understand its very difficult to focus on all pipelines given the tool usage and appreciate your kind response.

I just want to know if the significant variant in the loci shouldl be present in the credible set or not.

Regards
Akhil

@laleoarrow
Copy link
Author

Hi, quick update for all. I have solved my issue (partially at least). I used ieugwasr package to calculate my ld matrix for a GWAS summary data using 1kg individual data as a reference. ieugwasr incorporated the plink 1.9 when calculating ld matrix and it does have the -keep-allele-order flag in it, however, the allele order would still possible be inconsistent with allele order in your own summary data. So the solution is to flip the allele order when needed. This should at least get you to converge in 100 iteration and a better z-score plot.

Nonetheless, I identified 9 cs in a loci with only 200+ snp. Is this possible? Here is the cs plot and z-score plot. Can anyone spot anything unusual in it? Thanks!
69181713446954_ pic
69111713446843_ pic

@laleoarrow
Copy link
Author

Thank you so much for the response. I understand its very difficult to focus on all pipelines given the tool usage and appreciate your kind response.

I just want to know if the significant variant in the loci shouldl be present in the credible set or not.

Regards Akhil

The top snp should be in the strongest cs from my perspective.

@gaow
Copy link
Member

gaow commented Apr 21, 2024

@leoarrow1 @akhilpampana these problems typically has to do with problematic bioinformatics steps to unify the alleles between summary statistics and LD. The discrepancy between z-scores and LD reference is also a secondary concern.

We do have some pipelines that are still work in progress, to try cope with these issues. It contains two parts:

  1. we unify or harmonize the alleles between data-sets to the best of our knowledge.
  2. we apply some approaches to detect and remove variants where LD and z-scores have large discrepancies.

the 2nd item is still work in progress not ready for public use, although we have a separate branch of susieR to implement the prototype. The entire procedure is outlined here althought at this point it might be the most helpful to compare notes of your procedure with our basic QC for summary statistics.

@akhilpampana
Copy link

I think I have figured out my issue regarding the discrepancy of the plot but I will checkthrough the tutorial mentioned above just to be sure and test the analysis in the updated version of the ld matrix.

Thank you so much @gaow for the pipeline and @leoarrow1 for the suggestion

Regards
Akhil

@pcarbo
Copy link
Member

pcarbo commented Apr 22, 2024

Thank you for sharing @akhilpampana. If you could share in more detail what was the issue and how you solved it, this might be helpful for us (and others) to know.

@pcarbo
Copy link
Member

pcarbo commented Apr 22, 2024

The top snp should be in the strongest cs from my perspective.

This will often be true, but sometimes not. See Fig. 1 of the susie paper for an illustration of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants