This repository contains code and instructions for heritability and genetic correlation analyses, using the MultiResponseVarianceComponentModels.jl Julia package and applying it to the PsychENCODE brain gene and isoform expression data. There is a great number of primary literature, review papers, and educational materials that explain in broad strokes the heuristics of heritability and genetic correlation analyses. Briefly, heritability captures the degree of genetic effects, while genetic correlation captures the extent of shared genetic influences or genetic overlap and pleiotropy. If you are interested more in technical details with clarity in presentation and mathematical notation, please take a look at the Methods section of the associated paper Kim et al. 2022. All analyses herein were conducted solely using Julia. If you have any questions, let me know via my email [email protected].
To install necessary packages and activate a separate environment, open Julia within the directory and type:
julia> ]
(@v1.10) pkg> activate .
(isoform-genetics) pkg> instantiate
Some data like GENCODE is automatically downloaded when it is missing, whereas other data like the ones below are not publicly accessible and need to be made available before running any Julia code. See below for some notes on these required data.
data
├── expression
│ ├── PsychENCODE-EUR-gene.BED.gz # normalized PsychENCODE gene expression
│ ├── PsychENCODE-EUR-isoform.BED.gz # normalized PsychENCODE isoform expression
│ └── PsychENCODE-EUR-covariates.tsv # covariates for mean (or fixed) effects
├── genotype
│ └── Capstone4.HRC.European.unique.frontal.nochr.filter.unrelated.{bed,bim,fam} # PsychENCODE genotype data
├── 1kg
│ └── kgp.EUR.maf0.05.geno.{bed,bim,fam} # 1000 Genomes data subsetted to European individuals
└── gwas # GWAS summary statistics
Couple notes:
PsychENCODE-EUR-gene.BED.gz
andPsychENCODE-EUR-isoform.BED.gz
: RNA-seq reads were previously aligned to the hg19 reference genome with STAR 2.4.2a and gene and isoform-level quantifications calculated using RSEM v1.2.29. Genes and isoforms were filtered to include those with TPM > 0.1 in at least 25% of samples. Gene and isoform expression were separately normalized using TMM normalization in edgeR and log2-transformed. RNA-seq data was also restricted to frontal cortex samples from European individuals as well as genes and isoforms belonging to autosomal chromosomes, resulting in a total of 24,905 genes and 93,293 isoforms based on GENCODE v19 annotation.PsychENCODE-EUR-covariates.tsv
: The same set of known biological and technical covariates were used for mean or fixed effects, which include age, age2, study, sex, diagnosis, RNA integrity number (RIN), RIN2, post-mortem interval (PMI), 24 sequencing principal components (PCs), and 5 genetic PCs.Capstone4.HRC.European.unique.frontal.nochr.filter.unrelated.{bed,bim,fam}
: Genotype data were previously harmonized through phasing and imputation with the Haplotype Reference Consortium (HRC) reference panel. We focused on 860 unique European individuals with matching genotype and frontal cortex RNA-seq data. We started with 5,312,508 HRC imputed SNPs and filtered for SNPs with minor allele frequency (MAF) > 0.01, genotype and individual missingness rate < 0.05, and Hardy-Weinberg equilibrium (HWE) P values > 10-6. Five pairs of individuals had classic genetic relationship matrix (GRM) values > 0.05 when using all filtered SNPs, while 647 pairs of individuals had GRM values > 0.025. We kept one individual from each of five pairs and only SNPs belonging to autosomal chromosomes, resulting in a total of 855 unrelated European individuals and 4,685,674 SNPs for downstream analyses.kgp.EUR.maf0.05.geno.{bed,bim,fam}
: Linkage disequilibrium (LD) reference panel based on individuals of European ancestry in the 1000 Genomes Project was generated using https://github.com/mmkim1210/1kg. This data was only used for visualization purpose, and not for actual variance components analysis.gwas
: Multiple GWAS summary statistics were downloaded and harmonized usingmungesumstats!
function of the GeneticsMakie.jl Julia package. This data was not required for variance components analysis.
MultiResponseVarianceComponentModels.jl is a Julia package for fitting and testing multivariate response variance components linear mixed models of form
where
Herein, we initially modelled human brain gene and isoform-level expression using univariate variance components linear mixed models with PsychENCODE-EUR-covariates.tsv
as mean effects covariates and specified three variance components, two of which capture cis- and trans-SNP genetic effects. In other words,
We fitted multivariate variance components linear mixed models as well for isoform-level expression. We similarly specified three variance components, one of which captures cis-SNP genetic effects and the other trans effects. We used the same set of cis-SNPs that were used for univariate models. To reduce computational burden and the number of variance components parameters that need to be estimated, given limited sample size of the PsychENCODE dataset, we ran the multivariate model for isoforms with significant heritability estimates in a univariate model at P value < 0.05. This meant modeling up to 23 isoforms or ./submit.sh
as follows.
The analyses will require at least ./submit.sh
should have been more than sufficient.
qsub ./submit.sh
Once the results are parsed with ./src/parse.jl
, Figures 1-5 in Kim et al. 2022 can be reproduced by running ./src/fig{1,2,3,4,5}.jl
, respectively. Then minor edits can be made using Illustrator
to finalize the figures as follows.
Figure 1:
Figure 2:
Figure 3:
Figure 4:
Figure 5:
- M. Kim, D.D. Vo, C.T. Jops, C. Wen, A. Patowary, A. Bhattacharya, C.X. Yap, H. Zhou, and M.J. Gandal: Multivariate variance components analysis uncovers genetic architecture of brain isoform expression and novel psychiatric disease mechanisms (2022) (link)