03_advanced_GWAS.Rmd

# 3. Advanced GWAS Models

This tutorial extends what was previously introduced in the `02_GWAS.Rmd` tutorial to more advanced models using the [GENESIS](https://bioconductor.org/packages/release/bioc/html/GENESIS.html) R/Bioconductor package.

## Sparse Kinship Matrix

Recall that fitting the null model uses a kinship matrix (KM) that captures the genetic correlation among samples. In the `02_GWAS.Rmd` tutorial, we used a dense KM that has non-zero values for all entries in the matrix. This works well with small sample sizes (like we have here), but can require a lot of memory and be very computationally demanding with large samples. 

In large population based samples, we can make an empirical KM sparse by zeroing out small values that are near 0. When creating a PC-Relate KM with `pcrelateToMatrix` this can be done by setting the `thresh` parameter equal to the smallest non-zero value to keep in the matrix. Alternatively, we can use the `makeSparseMatrix` function from the GENESIS package to sparsify an existing matrix into a sparse block-diagonal matrix. 

```{r, sparse}
repo_path <- "https://github.com/UW-GAC/SISG_2024/raw/main"
if (!dir.exists("data")) dir.create("data")

library(GENESIS)

# load the PC-Relate KM
kmfile <- "data/pcrelate_Matrix.RData"
if (!file.exists(kmfile)) download.file(file.path(repo_path, kmfile), kmfile)
km <- get(load(kmfile))
dim(km)
km[1:5,1:5]

# make the KM sparse at 4th degree relatedness
skm <- makeSparseMatrix(km, thresh = 2*2^(-11/2))
dim(skm)
skm[1:5,1:5]
```
This sparse KM can be used in the null model in place of the original dense KM that we used in the `02_GWAS.Rmd` tutorial -- exercise left to the reader.


## Two Stage Model

As discussed in the lecture, we recommend a fully adjusted two-stage inverse Normalization procedure for fitting the null model for quantitative traits, particularly when the outcome has a non-Normal distribution. See [Sofer et al. (2019)](https://onlinelibrary.wiley.com/doi/10.1002/gepi.22188) for more information on the fully adjusted two-stage model.

### Phenotype Data

First we load the `AnnotatedDataFrame` with the phenotype data that we prepared previously.

```{r, message = FALSE}
library(Biobase)
```

```{r}
# pheno data
annotfile <- "data/pheno_annotated_pcs.RData"
if (!file.exists(annotfile)) download.file(file.path(repo_path, annotfile), annotfile)
annot <- get(load(annotfile))
head(pData(annot))
```

For this section of the tutorial, we will be analyzing `trait_2`. Make a histogram of the trait_2 values -- what do you notice about the distribution?

```{r}
library(ggplot2)
ggplot(pData(annot), aes(x = trait_2)) + geom_histogram()
```

### Standard Analysis

First, let's run the GWAS using the standard LMM. Recall that we first fit the null model and then perform single variant score tests.

#### Null Model

We use the `fitNullModel` function from GENESIS. We need to specify the `AnnotatedDataFrame` with the phenotype data (annot), the outcome variable (trait_2), and the fixed effect covariates (sex, age, and PCs 1-7). We also include the kinship matrix to specify the covariance structure of the polygenic random effect in the model.

```{r}
nullmod <- fitNullModel(annot, 
                        outcome = "trait_2", 
                        covars=c("sex", "age", paste0("PC", c(1:7))), 
                        cov.mat=km, 
                        verbose=TRUE)
```

#### Association Tests

After fitting the null model, we use single-variant score tests to test each variant across the genome separately for association with the outcome, accounting for genetic ancestry and genetic relatedness among the samples. Recall that we make a `SeqVarData` iterator object that links the genotype data in the GDS file with the phenotype data in our `AnnotatedDataFrame` and use the `assocTestSingle` function from GENESIS to perform the association tests.

```{r, message = FALSE}
library(SeqVarTools)

# open a connection to the GDS file
gdsfile <- "data/1KG_phase3_GRCh38_subset_chr1.gds"
if (!file.exists(gdsfile)) download.file(file.path(repo_path, gdsfile), gdsfile)
gdsfmt::showfile.gds(closeall=TRUE) # make sure file is not already open
gds <- seqOpen(gdsfile)

# make the seqVarData object
seqData <- SeqVarData(gds, sampleData=annot)

# make the iterator object
iterator <- SeqVarBlockIterator(seqData, verbose=FALSE)

# run the single-variant association test
assoc <- assocTestSingle(iterator, 
                         null.model = nullmod,
                         test = "Score")
dim(assoc)
```

We make a QQ plot of all variants with MAC >= 5. What do you notice about this QQ plot?

```{r}
qqPlot <- function(pval) {
    pval <- pval[!is.na(pval)]
    n <- length(pval)
    x <- 1:n
    dat <- data.frame(obs=sort(pval),
                      exp=x/n,
                      upper=qbeta(0.025, x, rev(x)),
                      lower=qbeta(0.975, x, rev(x)))

    ggplot(dat, aes(-log10(exp), -log10(obs))) +
        geom_line(aes(-log10(exp), -log10(upper)), color="gray") +
        geom_line(aes(-log10(exp), -log10(lower)), color="gray") +
        geom_point() +
        geom_abline(intercept=0, slope=1, color="red") +
        xlab(expression(paste(-log[10], "(expected P)"))) +
        ylab(expression(paste(-log[10], "(observed P)"))) +
        theme_bw()
}

# make a QQ plot
qqPlot(assoc$Score.pval[assoc$MAC >= 5])
```

### Fully-Adjusted Two-Stage Model Analysis

Now, let's run the GWAS using the fully-adjusted two-stage LMM. 

#### Null Model

To run the fully-adjusted two-stage null model, we simply set the `two.stage` option to `TRUE`. 

```{r}
# fit the two stage null model
nullmod.twostage <- fitNullModel(annot, 
                                 outcome = "trait_2", 
                                 covars=c("sex", "age", paste0("PC", c(1:7))), 
                                 cov.mat=km, 
                                 two.stage = TRUE,
                                 verbose=TRUE)
```

Notice that the messages from the function show that the model was fit twice -- first with the original outcome and second with the inverse-Normal transformed residuals.

```{r}
# description of the model we fit
nullmod.twostage$model
```

From the model description, we can see this is a two stage model because the formula element has `rankInvNorm(resid(trait_2))` as the outcome variable.

#### Association Tests

After fitting the two-stage null model, the association testing procedure is exactly the same. Since we've already created our `SeqVarData` iterator, we do not need to create it again; however, we do need to "reset" it to the first block of data.

```{r, message = FALSE}
# reset the filter to the first block
seqResetFilter(iterator, verbose = FALSE)

# run the single-variant association test
assoc.twostage <- assocTestSingle(iterator, 
                                  null.model = nullmod.twostage,
                                  test = "Score")
dim(assoc.twostage)
```

We make a QQ plot of all variants with MAC >= 5. What do you notice about this QQ plot compared to the one from the standard null model? 

```{r}
# make a QQ plot
qqPlot(assoc.twostage$Score.pval[assoc.twostage$MAC >= 5])
```

From the QQ plot, we see that using the fully-adjusted two-stage model substantially decreased the amount of inflation in the test statistics. We know that this is because of the non-Normality / skewness of trait_2, but we can compare the marginal residuals from both null models to see how their distributions look after covariate adjustment

```{r null_model_two_stage}
# merge the data for plotting
pdat <- merge(nullmod$fit, nullmod.twostage$fit,
              by = 'sample.id', suffixes = c('.orig', '.twostage'))
head(pdat, 2)

# distribution of residuals - original null model
ggplot(pdat, aes(x = resid.marginal.orig)) + geom_histogram()

# distribution of residuals - two stage null model
ggplot(pdat, aes(x = resid.marginal.twostage)) + geom_histogram()

# compare residuals
ggplot(pdat, aes(x = resid.marginal.orig, y = resid.marginal.twostage)) +
    geom_point() +
    geom_abline(intercept = 0, slope = 1)

```

As expected, the residuals from the original model are very skewed, while the residuals from the two-stage model are much closer to Normally distributed. The skewness of the residuals in the original model leads to inflation in variant test statistics (as seen in the QQ plots), particularly for rare or low frequency variants, where effect allele carriers with extreme residuals in the tails of the distribution can have high leverage on the score statistics. 


## Binary Traits

GENESIS also supports testing binary (e.g. case/control) outcomes. 

### Phenotype Data

The outcome `status` in the annotated phenotype data is a binary case/control variable. Look at the prevalence of this outcome

```{r}
table(pData(annot)$status)
```

### Logistic Mixed Model Analysis

GENESIS implements the GMMAT method of approximate logistic mixed models for association testing with binary outcomes. See [Chen et al. (2016)](https://www.sciencedirect.com/science/article/pii/S000292971600063X) for more information on the GMMAT method.

#### Null Model

We can fit a null model using a logistic mixed model by specifying the argument `family=binomial` in the `fitNullModel` function. As before, we need to specify the `AnnotatedDataFrame` with the phenotype data (annot), the outcome variable (status), and the fixed effect covariates (sex, age, and PCs 1-7). We still include the kinship matrix to specify the covariance structure of the polygenic random effect in the model.

```{r}
# fit the null model with logistic mixed model
nullmod.status <- fitNullModel(annot,
                               outcome="status",
                               covars=c("sex", "age", paste0("PC", 1:7)),
                               cov.mat = km,
                               family=binomial,
                               verbose=TRUE)
```

#### Association Tests

After fitting the logistic null model, the association testing procedure is still exactly the same. Since we've already created our `SeqVarData` iterator, we do not need to create it again; however, we do need to "reset" it to the first block of data.

```{r, message = FALSE}
# reset the filter to the first block
seqResetFilter(iterator, verbose = FALSE)

# run the single-variant association test
assoc.status <- assocTestSingle(iterator, 
                                null.model = nullmod.status,
                                test = "Score")
dim(assoc.status)
```

As usual, we make a QQ plot of all variants with MAC >= 5. What do you notice about the tail of this QQ plot? 

```{r}
# make a QQ plot
qqPlot(assoc.status$Score.pval[assoc.status$MAC >= 5])
```

#### SPA test

In samples with highly imbalanced case:control ratios, the Score test can be inflated for rare and low frequency variants. Saddlepoint approximation (SPA) can be used to improve $p$-value calculations, and is available in GENESIS by setting the argument `test=Score.SPA` in `assocTestSingle`. See [Dey et al. (2017)](https://www.cell.com/ajhg/fulltext/S0002-9297(17)30201-X) and [Zhou et al. (2018)](https://www.nature.com/articles/s41588-018-0184-y) for details on using SPA in GWAS. Re-run the analysis using the SPA $p$-value calculation.

```{r, message = FALSE}
# reset the filter to the first block
seqResetFilter(iterator, verbose = FALSE)

# run the single-variant association test
assoc.spa <- assocTestSingle(iterator, 
                             null.model = nullmod.status,
                             test = "Score.SPA")
dim(assoc.spa)
head(assoc.spa)
```
Notice that the results of this test include two new columns: `SPA.pval`, which are the $p$-values after the SPA adjustment, and `SPA.converged` which indicates if the SPA adjustment was successful (`TRUE/FALSE`); if the value is `NA` then the SPA adjustment was not applied and the original score test $p$-value is returned -- for computational efficiency, SPA is only applied when the original score test $p$-value $< 0.05$. Note that the `Score`, `Score.SE`, and `Score.Stat` are exactly the same for all variants as when using the standard score test, as the SPA adjustment only alters the $p$-value.

```{r}
table(assoc.spa$SPA.converged, exclude = NULL)
```

Make a new QQ plot of all variants with MAC >= 5. How does using the SPA $p$-value adjustment affect the results? 

```{r}
# make a QQ plot
qqPlot(assoc.spa$SPA.pval[assoc.spa$MAC >= 5])
```

Using the SPA $p$-value adjustment has resolved the test statistic inflation!


```{r}
# close the GDS file!
seqClose(seqData)
```