forked from ksiegmund/PM579
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathw10-genenetworks-mouseliver.Rmd
487 lines (414 loc) · 19.3 KB
/
w10-genenetworks-mouseliver.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
---
title: "Gene Networks-mouse liver"
author: "ks"
date: "7/22/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Compilation of Tutorials for WGCNA package in R from the following website
http://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/index.html
Some omissions were made, but mostly this is taken verbatim from the tutorials.
##1. Data input and Cleaning
Libraries we will need to follow tutorials.
```{r install}
if(!require('impute')) {BiocManager::install("impute")}
if(!require('preprocessCore')) {BiocManager::install("preprocessCore")}
#if(!require('GO.db')) {BiocManager::install("GO.db")}
if(!require('cluster')) {BiocManager::install("cluster")}
if(!require('Hmisc')) {BiocManager::install("Hmisc")}
if(!require(WGCNA)) {BiocManager::install("WGCNA")}
if(!require(flashClust)) {install.packages("flashClust")}
#install.packages(c("dynamicTreeCut", "reshape", "foreach", "doParallel") )
```
```{r loadlibs}
library(cluster)
library(Hmisc) # install it for the C-index
library(MASS) # standard, no need to install
library(class) # standard, no need to install
library(WGCNA)
library(flashClust)
```
If a *stdio.h error* is returned, try typing the following in a command window and then try above code again:
xcode-select --install
This is the first step of any network analysis. We show here how to load typical expression data, pre-process them
into a format suitable for network analysis, and clean the data by removing obvious outlier samples as well as genes
and samples with excessive numbers of missing entries.
### 1a. Loading the expression data:
The expression data is contained in the file LiverFemale3600.csv that comes with this tutorial. After starting an
R session and loading the requisite packages shown above, we read in the data:
```{r readData}
dataDir <- c("pm579data/WGCNA");
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
#Read in the female liver data set
femData <- read.csv(file.path(dataDir,"LiverFemale3600.csv"));
# Take a quick look at what is in the data set:
dim(femData);
names(femData);
```
In addition to expression data, the data files contain extra information about the surveyed probes we do not need.
One can inspect larger data frames such as femData by invoking R data editor via fix(femData). The expression data
set contains 135 samples. Note that each row corresponds to a gene and column to a sample or auxiliary information.
We now remove the auxiliary data and transpose the expression data for further analysis.
```{r pheno}
datExpr0 <- as.data.frame(t(femData[, -c(1:8)]));
names(datExpr0) <- femData$substanceBXH;
rownames(datExpr0) <- names(femData)[-c(1:8)];
```
### 1b. Check for excessive missing values and outlier samples
We first check for genes and samples with too many missing values.
```{r outliers}
gsg <- goodSamplesGenes(datExpr0, verbose = 3);
names(gsg)
gsg$allOK
```
If the last statement returns TRUE, all genes have passed the cuts. If not, we remove the offending genes and samples
from the data:
```{r filt}
if (!gsg$allOK)
{
# Optionally, print the gene and sample names that were removed:
if (sum(!gsg$goodGenes)>0)
printFlush(paste("Removing genes:", paste(names(datExpr0)[!gsg$goodGenes], collapse = ", ")));
if (sum(!gsg$goodSamples)>0)
printFlush(paste("Removing samples:", paste(rownames(datExpr0)[!gsg$goodSamples], collapse = ", ")));
# Remove the offending genes and samples from the data:
datExpr0 <- datExpr0[gsg$goodSamples, gsg$goodGenes]
}
```
Next we cluster the samples (in contrast to clustering genes that will come later) to see if there are any obvious
outliers.
```{r cluster}
sampleTree <- flashClust(dist(datExpr0), method = "average");
# Plot the sample tree: Open a graphic output window of size 12 by 9 inches
# The user should change the dimensions if the window is too large or too small.
#sizeGrWindow(12,9)
#pdf(file = "Plots/sampleClustering.pdf", width = 12, height = 9);
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5,
cex.axis = 1.5, cex.main = 2)
# Plot a line to show the cut
abline(h = 15, col = "red");
# Determine cluster under the line
clust <- cutreeStatic(sampleTree, cutHeight = 15, minSize = 10)
```
It appears there is one outlier (sample F2_221, see Figure above). One can remove it by hand, or use an automatic approach.
Choose a height cut that will remove the offending sample, say 15 (the red line in the plot), and use a branch cut at
that height.
```{r tc}
table(clust)
# clust 1 contains the samples we want to keep.
keepSamples <- (clust==1)
datExpr <- datExpr0[keepSamples, ]
nGenes <- ncol(datExpr)
nSamples <- nrow(datExpr)
```
The variable datExpr now contains the expression data ready for network analysis.
###1c. Loading clinical trait data
We now read in the trait data and match the observations (mice in this data set) to the expression data.
```{r traits}
dataDir <- c("pm579data/WGCNA");
traitData <- read.csv(file.path(dataDir,"ClinicalTraits.csv"));
dim(traitData)
names(traitData)
# remove columns that hold information we do not need.
allTraits <- traitData[, -c(16, 31)];
allTraits <- allTraits[, c(2, 11:36) ];
dim(allTraits)
names(allTraits)
```
Form a data frame analogous to expression data that will hold the clinical traits.
```{r df}
femaleSamples <- rownames(datExpr);
traitRows <- match(femaleSamples, allTraits$Mice);
datTraits <- allTraits[traitRows, -1];
rownames(datTraits) <- allTraits[traitRows, 1];
collectGarbage();
```
We now have the expression data in the variable datExpr, and the corresponding clinical traits in the variable
datTraits. Before we continue with network construction and module detection, we visualize how the clinical traits
relate to the sample dendrogram.
```{r sampleT}
sampleTree2 <- flashClust(dist(datExpr), method = "average")
# Convert traits to a color representation: white means low, red means high, grey means missing entry
traitColors <- numbers2colors(datTraits, signed = FALSE);
# Plot the sample dendrogram and the colors underneath.
plotDendroAndColors(sampleTree2, traitColors,
groupLabels = names(datTraits),
main = "Sample dendrogram and trait heatmap")
```
In the plot above, white means a low value, red a high value, and grey a missing entry.
The last step is to save the relevant expression and trait data for use in the next steps of the tutorial
```{r savedatExpr}
save(datExpr, datTraits, file = "FemaleLiver-01-dataInput.RData")
```
# 2. Network Construction and Module Detection
##2a. Automatic construction of gene network and module identification in female mice
First, we allow multi-threading within WGCNA if your computer has this ability. This helps speed up certain calculations. At present this call is not necessary for the code to work. Any error here may be ignored but you may want to update WGCNA if you see one.
```{r multithread}
allowWGCNAThreads();
```
```{r loadData}
# Load the data saved in the first part
lnames <- load(file = "FemaleLiver-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames
```
### 2a.1 Choosing the soft-thresholding power: analysis of network topology
Constructing a weighted gene network entails the choice of the soft
thresholding power β to which co-expression
similarity is raised to calculate adjacency [1]. The authors of [1]
have proposed to choose the soft thresholding power
based on the criterion of approximate scale-free topology. We refer the
reader to that work for more details; here
we illustrate the use of the function pickSoftThreshold that performs
the analysis of network topology and aids the
user in choosing a proper soft-thresholding power. The user chooses a
set of candidate powers (the function provides
suitable default values), and the function returns a set of network
indices that should be inspected, for example as follows:
```{r connect}
# Choose a set of soft-thresholding powers
powers <- c(c(1:10), seq(from = 12, to=20, by=2))
# Call the network topology analysis function
sft <- pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
# Plot the results:
#sizeGrWindow(9, 5)
par(mfrow = c(1,2));
cex1 = 0.9;
# Scale-free topology fit index as a function of the soft-thresholding power
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab="Soft Threshold (power)",
ylab="Scale Free Topology Model Fit,signed R^2",type="n",
main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers,cex=cex1,col="red");
# this line corresponds to using an R^2 cut-off of h
abline(h=0.90,col="red")
# Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")
```
The result is shown in the Fig. above. We choose the power 6, which is the lowest power for which the scale-free topology fit
index curve flattens out upon reaching a high value (in this case, roughly 0.90).
### 2a.2 One-step network construction and module detection
Constructing the gene network and identifying modules is now a simple function call:
```{r blockM}
net <- blockwiseModules(datExpr, power = 6, minModuleSize = 30,
reassignThreshold = 0, mergeCutHeight = 0.25,
numericLabels = TRUE, pamRespectsDendro = FALSE,
saveTOMs = TRUE,
saveTOMFileBase = "femaleMouseTOM",
verbose = 3)
```
We have chosen the soft thresholding power 6, a relatively large
minimum module size of 30, and a medium sensitivity
(deepSplit=2) to cluster splitting. The parameter mergeCutHeight
is the threshold for merging of modules. We have
also instructed the function to return numeric, rather than color,
labels for modules, and to save the Topological
Overlap Matrix. The output of the function may seem somewhat cryptic,
but it is easy to use. For example,
net\$colors contains the module assignment, and net\$MEs contains
the module eigengenes of the modules.
Read the help files for how to modify defaults for different data sets.
Read the .pdf on the website for words of caution with data sets
of size >5000 features.
We now return to the network analysis. To see how many modules
were identified and what the module sizes are, one can use
table(net$colors). Its output is
```{r NumMod}
## Number of modules (18) and sizes (# genes)
table(net$colors)
```
and indicates that there are 18 modules, labeled 1 through 18 in order of descending size, with sizes ranging from
609 to 34 genes. The label 0 is reserved for genes outside of all modules.
The hierarchical clustering dendrogram (tree) used for the module identification is returned in net\$dendrograms[[1]].
The dendrogram can be displayed together with the color assignment using the following code:
```{r plotTree}
# open a graphics window
#sizeGrWindow(12, 9)
# Convert labels to colors for plotting
mergedColors <- labels2colors(net$colors)
# Plot the dendrogram and the module colors underneath
plotDendroAndColors(net$dendrograms[[1]], mergedColors[net$blockGenes[[1]]],
"Module colors",
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05)
```
Save module assignment and module eigengene information necessary for subsequent analysis.
```{r dataSaving}
moduleLabels <- net$colors
moduleColors <- labels2colors(net$colors)
MEs <- net$MEs;
geneTree <- net$dendrograms[[1]];
save(MEs, moduleLabels, moduleColors, geneTree,
file = "FemaleLiver-02-networkConstruction-auto.RData")
```
## 2b. Network construction (step-by-step) and module detection
see: w10code2b-genenetworks-mouseliver.Rmd
## 2c. When data set too big to analyze at 1 time
(skipped)
# 3. Relating Modules to external clinical traits and identifying important genes
Re-load data saved from earlier network analysis.
```{r reloaddata}
# Load the expression and trait data saved in the first part
lnames <- load(file = "FemaleLiver-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames
# Load network data saved in the second part.
lnames <- load(file = "FemaleLiver-02-networkConstruction-auto.RData");
lnames
```
##3a. Quantifying module–trait associations
In this analysis we would like to identify modules that are significantly associated with the measured clinical traits.
Since we already have a summary profile (eigengene) for each module, we simply correlate eigengenes with external
traits and look for the most significant associations:
```{r eigenvec}
# Define numbers of genes and samples
nGenes <- ncol(datExpr);
nSamples <- nrow(datExpr);
# Recalculate MEs with color labels
MEs0 <- moduleEigengenes(datExpr, moduleColors)$eigengenes
MEs <- orderMEs(MEs0)
moduleTraitCor <- cor(MEs, datTraits, use = "p");
moduleTraitPvalue <- corPvalueStudent(moduleTraitCor, nSamples);
```
Since we have a moderately large number of modules and traits, a suitable graphical representation will help in
reading the table. We color code each association by the correlation value:
```{r heatmap}
# sizeGrWindow(10,6)
# Will display correlations and their p-values
textMatrix <- paste(signif(moduleTraitCor, 2), "\n(",
signif(moduleTraitPvalue, 1), ")", sep = "");
dim(textMatrix) <- dim(moduleTraitCor)
par(mar = c(6, 8.5, 3, 3));
# Display the correlation values within a heatmap plot
labeledHeatmap(Matrix = moduleTraitCor,
xLabels = names(datTraits),
yLabels = names(MEs),
ySymbols = names(MEs),
colorLabels = FALSE,
colors = greenWhiteRed(50),
textMatrix = textMatrix,
setStdMargins = FALSE,
cex.text = 0.5,
zlim = c(-1,1),
main = paste("Module-trait relationships"))
```
The analysis identifies the several significant module–trait associations. We will concentrate on weight as the trait
of interest.
##3b. Gene relationship to trait and important modules: Gene Significance and Module Membership
We quantify associations of individual genes with our trait of interest (weight) by defining Gene Significance GS as
(the absolute value of) the correlation between the gene and the trait. For each module, we also define a quantitative
measure of module membership MM as the correlation of the module eigengene and the gene expression profile. This
allows us to quantify the similarity of all genes on the array to every module.
```{r weight}
# Define variable weight containing the weight column of datTrait
weight <- as.data.frame(datTraits$weight_g);
names(weight) <- "weight"
# names (colors) of the modules
modNames <- substring(names(MEs), 3)
geneModuleMembership <- as.data.frame(cor(datExpr, MEs, use = "p"));
MMPvalue <- as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples));
names(geneModuleMembership) <- paste("MM", modNames, sep="");
names(MMPvalue) <- paste("p.MM", modNames, sep="");
geneTraitSignificance <- as.data.frame(cor(datExpr, weight, use = "p"));
GSPvalue <- as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples));
names(geneTraitSignificance) <- paste("GS.", names(weight), sep="");
names(GSPvalue) <- paste("p.GS.", names(weight), sep="");
```
##3c. Intramodular analysis: identifying genes with high GS and MM
Using the GS and MM measures, we can identify genes that have a
high significance for weight as well as high module
membership in interesting modules. As an example, we look at
the brown module that has the highest association
with weight. We plot a scatterplot of Gene Significance vs.
Module Membership in the brown module:
```{r module}
module <- "brown"
column <- match(module, modNames);
moduleGenes <- moduleColors==module;
#sizeGrWindow(7, 7);
par(mfrow = c(1,1));
verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
abs(geneTraitSignificance[moduleGenes, 1]),
xlab = paste("Module Membership in", module, "module"),
ylab = "Gene significance for body weight",
main = paste("Module membership vs. gene significance\n"),
cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)
```
Clearly, GS and MM are highly correlated, illustrating that genes highly
significantly associated with a trait are often also the most important
(central) elements of modules associated with the trait.
The reader is encouraged to try this code with other significance
trait/module correlation (for example, the magenta,
midnightblue, and red modules with weight).
##3d. Summary output of network analysis results
We have found modules with high association with our trait of interest, and have identified their central players by
the Module Membership measure. We can access the features names and use to merge with other annotation information (not shown)
```{r feat}
#length(names(datExpr))
#names(datExpr)[moduleColors=="brown"]
```
# 4. GO analysis
(skipped)
# 5. Visualizing the Network
##5a. Visualizing the gene network
One way to visualize a weighted network is to plot its heatmap. Each row and
column of the heatmap correspond to a single gene. The heatmap can depict
adjacencies or topological overlaps, with light colors denoting
low adjacency (overlap) and darker colors higher adjacency (overlap).
In addition, the gene dendrograms and module
colors are plotted along the top and left side of the heatmap.
The package provides a convenient function to create
such network plots. This TOM plot too big for quick viewing, try on
smaller number of genes.
```{r TOMplot}
nSelect <- 400
# For reproducibility, we set the random seed
set.seed(10);
select <- sample(nGenes, size = nSelect);
# Calculate topological overlap anew: this could be done more efficiently by saving the TOM
# calculated during module detection, but let us do it again here.
dissTOM <- 1-TOMsimilarityFromExpr(datExpr, power = 6);
selectTOM <- dissTOM[select, select];
library(flashClust)
selectTree <- flashClust(as.dist(selectTOM), method = "average")
selectColors <- moduleColors[select];
# Open a graphical window
#sizeGrWindow(9,9)
# Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing
# the color palette; setting the diagonal to NA also improves the clarity of the plot
plotDiss <- selectTOM^7;
diag(plotDiss) <- NA;
TOMplot(plotDiss, selectTree, selectColors, main = "Network heatmap plot, selected genes")
```
##5b. Visualizing the network of eigengenes
It is often interesting to study the relationships among the found modules. One can use the eigengenes as representative
profiles and quantify module similarity by eigengene correlation. The package contains a convenient function
plotEigengeneNetworks that generates a summary plot of the eigengene network. It is usually informative to add a
clinical trait (or multiple traits) to the eigengenes to see how the traits fit into the eigengene network:
```{r ploteigengene}
# plots of eigengenes
# Recalculate module eigengenes
MEs <- moduleEigengenes(datExpr, moduleColors)$eigengenes
# Isolate weight from the clinical traits
weight <- as.data.frame(datTraits$weight_g);
names(weight) = "weight"
# Add the weight to existing module eigengenes
MET <- orderMEs(cbind(MEs, weight))
# Plot the relationships among the eigengenes and the trait
#sizeGrWindow(5,7.5);
par(cex = 0.9)
plotEigengeneNetworks(MET, "", marDendro = c(0,4,1,2), marHeatmap = c(3,4,1,2), cex.lab = 0.8, xLabelsAngle = 90)
```
The function produces a dendrogram of the eigengenes and trait(s), and a heatmap of their relationships.
```{r sI}
sessionInfo()
```