w10-genenetworks-mouseliver.Rmd

---
title: "Gene Networks-mouse liver"
author: "ks"
date: "7/22/2020"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

#  Compilation of Tutorials for WGCNA package in R from the following website

http://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/index.html

Some omissions were made, but mostly this is taken verbatim from the tutorials.

##1.  Data input and Cleaning

Libraries we will need to follow tutorials.
```{r install}
if(!require('impute')) {BiocManager::install("impute")}
if(!require('preprocessCore'))   {BiocManager::install("preprocessCore")}
#if(!require('GO.db')) {BiocManager::install("GO.db")}
if(!require('cluster')) {BiocManager::install("cluster")}
if(!require('Hmisc')) {BiocManager::install("Hmisc")}

if(!require(WGCNA)) {BiocManager::install("WGCNA")}
if(!require(flashClust)) {install.packages("flashClust")}
#install.packages(c("dynamicTreeCut", "reshape", "foreach", "doParallel") )
```

```{r loadlibs}
library(cluster)
library(Hmisc)	# install it for the C-index 
library(MASS)  # standard, no need to install
library(class)  # standard, no need to install
library(WGCNA)
library(flashClust)
```

If a *stdio.h error* is returned, try typing the following in a command window and then try above code again:
xcode-select --install

This is the first step of any network analysis. We show here how to load typical expression data, pre-process them
into a format suitable for network analysis, and clean the data by removing obvious outlier samples as well as genes
and samples with excessive numbers of missing entries.

### 1a. Loading the expression data:

The expression data is contained in the file LiverFemale3600.csv that comes with this tutorial. After starting an
R session and loading the requisite packages shown above, we read in the data:
```{r readData}
dataDir <- c("pm579data/WGCNA");
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
#Read in the female liver data set
femData <- read.csv(file.path(dataDir,"LiverFemale3600.csv"));
# Take a quick look at what is in the data set:
dim(femData);
names(femData);
```
In addition to expression data, the data files contain extra information about the surveyed probes we do not need.
One can inspect larger data frames such as femData by invoking R data editor via fix(femData). The expression data
set contains 135 samples. Note that each row corresponds to a gene and column to a sample or auxiliary information.
We now remove the auxiliary data and transpose the expression data for further analysis.
```{r pheno}
datExpr0 <- as.data.frame(t(femData[, -c(1:8)]));
names(datExpr0) <- femData$substanceBXH;
rownames(datExpr0) <- names(femData)[-c(1:8)];
```

### 1b. Check for excessive missing values and outlier samples

We first check for genes and samples with too many missing values.

```{r outliers}
gsg <- goodSamplesGenes(datExpr0, verbose = 3);
names(gsg)
gsg$allOK
```

If the last statement returns TRUE, all genes have passed the cuts. If not, we remove the offending genes and samples
from the data:
```{r filt}
if (!gsg$allOK)
{
# Optionally, print the gene and sample names that were removed:
if (sum(!gsg$goodGenes)>0)
printFlush(paste("Removing genes:", paste(names(datExpr0)[!gsg$goodGenes], collapse = ", ")));
if (sum(!gsg$goodSamples)>0)
printFlush(paste("Removing samples:", paste(rownames(datExpr0)[!gsg$goodSamples], collapse = ", ")));
# Remove the offending genes and samples from the data:
datExpr0 <- datExpr0[gsg$goodSamples, gsg$goodGenes]
}
```

Next we cluster the samples (in contrast to clustering genes that will come later) to see if there are any obvious
outliers.
```{r cluster}
sampleTree <- flashClust(dist(datExpr0), method = "average");
# Plot the sample tree: Open a graphic output window of size 12 by 9 inches
# The user should change the dimensions if the window is too large or too small.
#sizeGrWindow(12,9)
#pdf(file = "Plots/sampleClustering.pdf", width = 12, height = 9);
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5,
cex.axis = 1.5, cex.main = 2)

# Plot a line to show the cut
abline(h = 15, col = "red");
# Determine cluster under the line
clust <- cutreeStatic(sampleTree, cutHeight = 15, minSize = 10)
```

It appears there is one outlier (sample F2_221, see Figure above). One can remove it by hand, or use an automatic approach.
Choose a height cut that will remove the offending sample, say 15 (the red line in the plot), and use a branch cut at
that height.

```{r tc}
table(clust)
# clust 1 contains the samples we want to keep.
keepSamples <- (clust==1)
datExpr <- datExpr0[keepSamples, ]
nGenes <- ncol(datExpr)
nSamples <- nrow(datExpr)
```

The variable datExpr now contains the expression data ready for network analysis.

###1c. Loading clinical trait data

We now read in the trait data and match the observations (mice in this data set) to the expression data.
```{r traits}
dataDir <- c("pm579data/WGCNA");
traitData <- read.csv(file.path(dataDir,"ClinicalTraits.csv"));
dim(traitData)
names(traitData)
# remove columns that hold information we do not need.
allTraits <- traitData[, -c(16, 31)];
allTraits <- allTraits[, c(2, 11:36) ];
dim(allTraits)
names(allTraits)
```

Form a data frame analogous to expression data that will hold the clinical traits.
```{r df}
femaleSamples <- rownames(datExpr);
traitRows <- match(femaleSamples, allTraits$Mice);
datTraits <- allTraits[traitRows, -1];
rownames(datTraits) <- allTraits[traitRows, 1];
collectGarbage();
```

We now have the expression data in the variable datExpr, and the corresponding clinical traits in the variable
datTraits. Before we continue with network construction and module detection, we visualize how the clinical traits
relate to the sample dendrogram.

```{r sampleT}
sampleTree2 <- flashClust(dist(datExpr), method = "average")
# Convert traits to a color representation: white means low, red means high, grey means missing entry
traitColors <- numbers2colors(datTraits, signed = FALSE);
# Plot the sample dendrogram and the colors underneath.
plotDendroAndColors(sampleTree2, traitColors,
              groupLabels = names(datTraits),
              main = "Sample dendrogram and trait heatmap")
```

In the plot above, white means a low value, red a high value, and grey a missing entry.
The last step is to save the relevant expression and trait data for use in the next steps of the tutorial
```{r savedatExpr}
save(datExpr, datTraits, file = "FemaleLiver-01-dataInput.RData")
```

# 2. Network Construction and Module Detection

##2a. Automatic construction of gene network and module identification in female mice 

First, we allow multi-threading within WGCNA if your computer has this ability. This helps speed up certain calculations. At present this call is not necessary for the code to work. Any error here may be ignored but you may want to update WGCNA if you see one.

```{r multithread}
allowWGCNAThreads();
```

```{r loadData}
# Load the data saved in the first part
lnames <- load(file = "FemaleLiver-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames
```

### 2a.1 Choosing the soft-thresholding power: analysis of network topology
Constructing a weighted gene network entails the choice of the soft 
thresholding power β to which co-expression
similarity is raised to calculate adjacency [1]. The authors of [1] 
have proposed to choose the soft thresholding power
based on the criterion of approximate scale-free topology. We refer the 
reader to that work for more details; here
we illustrate the use of the function pickSoftThreshold that performs 
the analysis of network topology and aids the
user in choosing a proper soft-thresholding power. The user chooses a 
set of candidate powers (the function provides
suitable default values), and the function returns a set of network 
indices that should be inspected, for example as follows:
```{r connect}
# Choose a set of soft-thresholding powers
powers <- c(c(1:10), seq(from = 12, to=20, by=2))
# Call the network topology analysis function
sft <- pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
# Plot the results:
#sizeGrWindow(9, 5)
par(mfrow = c(1,2));
cex1 = 0.9;
# Scale-free topology fit index as a function of the soft-thresholding power
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
    xlab="Soft Threshold (power)",
    ylab="Scale Free Topology Model Fit,signed R^2",type="n",
    main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
    labels=powers,cex=cex1,col="red");
# this line corresponds to using an R^2 cut-off of h
abline(h=0.90,col="red")
# Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
      xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
      main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")
```

The result is shown in the Fig. above. We choose the power 6, which is the lowest power for which the scale-free topology fit
index curve flattens out upon reaching a high value (in this case, roughly 0.90).

### 2a.2 One-step network construction and module detection
Constructing the gene network and identifying modules is now a simple function call:
```{r blockM}
net <- blockwiseModules(datExpr, power = 6, minModuleSize = 30,
  reassignThreshold = 0, mergeCutHeight = 0.25,
  numericLabels = TRUE, pamRespectsDendro = FALSE,
  saveTOMs = TRUE,
  saveTOMFileBase = "femaleMouseTOM",
  verbose = 3)
```

We have chosen the soft thresholding power 6, a relatively large 
minimum module size of 30, and a medium sensitivity
(deepSplit=2) to cluster splitting. The parameter mergeCutHeight 
is the threshold for merging of modules. We have
also instructed the function to return numeric, rather than color, 
labels for modules, and to save the Topological
Overlap Matrix. The output of the function may seem somewhat cryptic, 
but it is easy to use. For example,
net\$colors contains the module assignment, and net\$MEs contains 
the module eigengenes of the modules.

Read the help files for how to modify defaults for different data sets. 
Read the .pdf on the website for words of caution with data sets 
of size >5000 features.

We now return to the network analysis. To see how many modules 
were identified and what the module sizes are, one can use 
table(net$colors). Its output is

```{r NumMod}
## Number of modules (18) and sizes (# genes)
table(net$colors)
```

and indicates that there are 18 modules, labeled 1 through 18 in order of descending size, with sizes ranging from
609 to 34 genes. The label 0 is reserved for genes outside of all modules.
The hierarchical clustering dendrogram (tree) used for the module identification is returned in net\$dendrograms[[1]].
The dendrogram can be displayed together with the color assignment using the following code:

```{r plotTree}
# open a graphics window
#sizeGrWindow(12, 9)
# Convert labels to colors for plotting
mergedColors <- labels2colors(net$colors)
# Plot the dendrogram and the module colors underneath
plotDendroAndColors(net$dendrograms[[1]], mergedColors[net$blockGenes[[1]]],
   "Module colors",
   dendroLabels = FALSE, hang = 0.03,
   addGuide = TRUE, guideHang = 0.05)
```

Save module assignment and module eigengene information necessary for subsequent analysis.
```{r dataSaving}
moduleLabels <- net$colors
moduleColors <- labels2colors(net$colors)
MEs <- net$MEs;
geneTree <- net$dendrograms[[1]];

save(MEs, moduleLabels, moduleColors, geneTree,
      file = "FemaleLiver-02-networkConstruction-auto.RData")
```

## 2b. Network construction (step-by-step) and module detection

 see: w10code2b-genenetworks-mouseliver.Rmd

## 2c. When data set too big to analyze at 1 time

 (skipped)

#  3. Relating Modules to external clinical traits and identifying important genes 

Re-load data saved from earlier network analysis.
```{r reloaddata}
# Load the expression and trait data saved in the first part
lnames <- load(file = "FemaleLiver-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames
# Load network data saved in the second part.
lnames <- load(file = "FemaleLiver-02-networkConstruction-auto.RData");
lnames
```

##3a. Quantifying module–trait associations

In this analysis we would like to identify modules that are significantly associated with the measured clinical traits.
Since we already have a summary profile (eigengene) for each module, we simply correlate eigengenes with external
traits and look for the most significant associations:
```{r eigenvec}
# Define numbers of genes and samples
nGenes <- ncol(datExpr);
nSamples <- nrow(datExpr);
# Recalculate MEs with color labels
MEs0 <- moduleEigengenes(datExpr, moduleColors)$eigengenes
MEs <- orderMEs(MEs0)
moduleTraitCor <- cor(MEs, datTraits, use = "p");
moduleTraitPvalue <- corPvalueStudent(moduleTraitCor, nSamples);
```

Since we have a moderately large number of modules and traits, a suitable graphical representation will help in
reading the table. We color code each association by the correlation value:

```{r heatmap}
# sizeGrWindow(10,6)
# Will display correlations and their p-values
textMatrix <- paste(signif(moduleTraitCor, 2), "\n(",
                signif(moduleTraitPvalue, 1), ")", sep = "");
dim(textMatrix) <- dim(moduleTraitCor)
par(mar = c(6, 8.5, 3, 3));
# Display the correlation values within a heatmap plot
labeledHeatmap(Matrix = moduleTraitCor,
    xLabels = names(datTraits),
    yLabels = names(MEs),
    ySymbols = names(MEs),
    colorLabels = FALSE,
    colors = greenWhiteRed(50),
    textMatrix = textMatrix,
    setStdMargins = FALSE,
    cex.text = 0.5,
    zlim = c(-1,1),
    main = paste("Module-trait relationships"))
```

The analysis identifies the several significant module–trait associations. We will concentrate on weight as the trait
of interest.

##3b. Gene relationship to trait and important modules: Gene Significance and Module Membership

We quantify associations of individual genes with our trait of interest (weight) by defining Gene Significance GS as
(the absolute value of) the correlation between the gene and the trait. For each module, we also define a quantitative
measure of module membership MM as the correlation of the module eigengene and the gene expression profile. This
allows us to quantify the similarity of all genes on the array to every module.

```{r weight}
# Define variable weight containing the weight column of datTrait
weight <- as.data.frame(datTraits$weight_g);
names(weight) <- "weight"
# names (colors) of the modules
modNames <- substring(names(MEs), 3)
geneModuleMembership <- as.data.frame(cor(datExpr, MEs, use = "p"));
MMPvalue <- as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples));
names(geneModuleMembership) <- paste("MM", modNames, sep="");
names(MMPvalue) <- paste("p.MM", modNames, sep="");
geneTraitSignificance <- as.data.frame(cor(datExpr, weight, use = "p"));
GSPvalue <- as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples));
names(geneTraitSignificance) <- paste("GS.", names(weight), sep="");
names(GSPvalue) <- paste("p.GS.", names(weight), sep="");
```

##3c. Intramodular analysis: identifying genes with high GS and MM

Using the GS and MM measures, we can identify genes that have a 
high significance for weight as well as high module
membership in interesting modules. As an example, we look at 
the brown module that has the highest association
with weight. We plot a scatterplot of Gene Significance vs. 
Module Membership in the brown module:

```{r module}
module <- "brown"
column <- match(module, modNames);
moduleGenes <- moduleColors==module;
#sizeGrWindow(7, 7);
par(mfrow = c(1,1));
verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
 abs(geneTraitSignificance[moduleGenes, 1]),
  xlab = paste("Module Membership in", module, "module"),
  ylab = "Gene significance for body weight",
  main = paste("Module membership vs. gene significance\n"),
  cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)
```

Clearly, GS and MM are highly correlated, illustrating that genes highly
significantly associated with a trait are often also the most important 
(central) elements of modules associated with the trait.
The reader is encouraged to try this code with other significance 
trait/module correlation (for example, the magenta,
midnightblue, and red modules with weight).

##3d. Summary output of network analysis results

We have found modules with high association with our trait of interest, and have identified their central players by
the Module Membership measure. We can access the features names and use to merge with other annotation information (not shown)
```{r feat}
#length(names(datExpr))
#names(datExpr)[moduleColors=="brown"]
```

# 4. GO analysis

(skipped)

# 5. Visualizing the Network

##5a. Visualizing the gene network

One way to visualize a weighted network is to plot its heatmap. Each row and
column of the heatmap correspond to a single gene. The heatmap can depict
adjacencies or topological overlaps, with light colors denoting
low adjacency (overlap) and darker colors higher adjacency (overlap). 
In addition, the gene dendrograms and module
colors are plotted along the top and left side of the heatmap. 
The package provides a convenient function to create
such network plots. This TOM plot too big for quick viewing, try on 
smaller number of genes.

```{r TOMplot}
nSelect <- 400
# For reproducibility, we set the random seed
set.seed(10);
select <- sample(nGenes, size = nSelect);
# Calculate topological overlap anew: this could be done more efficiently by saving the TOM
# calculated during module detection, but let us do it again here.
dissTOM <- 1-TOMsimilarityFromExpr(datExpr, power = 6);
selectTOM <- dissTOM[select, select];
library(flashClust)
selectTree <- flashClust(as.dist(selectTOM), method = "average")
selectColors <- moduleColors[select];
# Open a graphical window
#sizeGrWindow(9,9)
# Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing
# the color palette; setting the diagonal to NA also improves the clarity of the plot
plotDiss <- selectTOM^7;
diag(plotDiss) <- NA;
TOMplot(plotDiss, selectTree, selectColors, main = "Network heatmap plot, selected genes")
```

##5b. Visualizing the network of eigengenes

It is often interesting to study the relationships among the found modules. One can use the eigengenes as representative
profiles and quantify module similarity by eigengene correlation. The package contains a convenient function
plotEigengeneNetworks that generates a summary plot of the eigengene network. It is usually informative to add a
clinical trait (or multiple traits) to the eigengenes to see how the traits fit into the eigengene network:
```{r ploteigengene}
# plots of eigengenes
# Recalculate module eigengenes
MEs <- moduleEigengenes(datExpr, moduleColors)$eigengenes
# Isolate weight from the clinical traits
weight <- as.data.frame(datTraits$weight_g);
names(weight) = "weight"
# Add the weight to existing module eigengenes
MET <- orderMEs(cbind(MEs, weight))
# Plot the relationships among the eigengenes and the trait
#sizeGrWindow(5,7.5);
par(cex = 0.9)
plotEigengeneNetworks(MET, "", marDendro = c(0,4,1,2), marHeatmap = c(3,4,1,2),      cex.lab = 0.8, xLabelsAngle = 90)
```

The function produces a dendrogram of the eigengenes and trait(s), and a heatmap of their relationships. 

```{r sI}
sessionInfo()
```