forked from ChiLiubio/microeco_tutorial
-
Notifications
You must be signed in to change notification settings - Fork 0
/
06-Explainable_class.Rmd
502 lines (389 loc) · 20.8 KB
/
06-Explainable_class.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
# Explainable class
We group trans_env and trans_func classes into 'Explainable class',
as environmental factors and microbial functions can be generally applied to explain microbial community structure and assembly.
## trans_env class
There may be some NA (missing value) in the user's env data.
If so, please add `complete_na = TRUE` for interpolation when creating the trans_env object.
### Example
Creating trans_env object has at least two ways.
The following is using additional environmental data which is not in the microtable object.
```{r, echo = TRUE}
# add_data is used to add the environmental data
t1 <- trans_env$new(dataset = dataset, add_data = env_data_16S[, 4:11])
```
Maybe a more general way is to directly use the data from sample_table of your microtable object.
To show this operation, we first merge additional table into sample_table to generate a new microtable object.
```{r, echo = TRUE}
new_test <- clone(dataset)
new_test$sample_table <- data.frame(new_test$sample_table, env_data_16S[rownames(new_test$sample_table), ])
# now new_test$sample_table has the whole data
new_test
```
Now let's use env_cols to select the required columns from sample_table in the microtable object.
```{r, echo = TRUE}
t1 <- trans_env$new(dataset = new_test, env_cols = 8:15)
```
Generally, it is beneficial to the understanding on environmental variables in order to better use more methods.
So, we first show the cal_diff and cal_autocor functions.
The cal_diff function is used to test the significance of variables across groups like we have shown in trans_alpha and trans_diff class parts.
```{r, echo = TRUE, eval = FALSE}
# use Wilcoxon Rank Sum and Signed Rank Tests as an example
t1$cal_diff(group = "Group", method = "wilcox")
t1$res_diff[, c(1, 2, 4, 6, 7)]
```
```{r, echo = FALSE}
t1$cal_diff(group = "Group", method = "wilcox")
pander::pander(t1$res_diff[1:7, c(1, 2, 4, 6, 7)])
```
Let’s perform the anova and show the letters in the box plot. We use list to store all the plots for each factor and plot them together.
```{r, echo = TRUE, eval = FALSE}
t1$cal_diff(method = "anova", group = "Group")
# place all the plots into a list
tmp <- list()
for(i in colnames(t1$data_env)){
tmp[[i]] <- t1$plot_diff(measure = i, add_sig_text_size = 5, xtext_size = 12) + theme(plot.margin = unit(c(0.1, 0, 0, 1), "cm"))
}
plot(gridExtra::arrangeGrob(grobs = tmp, ncol = 3))
```
```{r, out.width = "750px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_env_diff_all.png")
```
From the v0.12.0, the trans_env class supports the differential test of groups within each group by using the by_group parameter in cal_diff function.
```{r, echo = TRUE, eval = FALSE}
t1$cal_diff(group = "Type", by_group = "Group", method = "anova")
t1$plot_diff(measure = "pH", add_sig_text_size = 5)
```
```{r, out.width = "600px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_env_diff_bygroup.png")
```
Then we show the autocorrelations among variables.
```{r, echo = TRUE, eval = FALSE}
# require GGally package installed
t1$cal_autocor()
```
```{r, out.width = "800px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/trans_env_autocor1.png")
```
For different groups, please use group parameter to show the distributions of variables and the autocorrelations across groups.
```{r, echo = TRUE, eval = FALSE}
t1$cal_autocor(group = "Group")
```
```{r, out.width = "800px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/trans_env_autocor_group.png")
```
Then let's do the RDA analysis (db-RDA and RDA).
```{r, echo = TRUE, eval = FALSE}
# use bray-curtis distance to do dbrda
t1$cal_ordination(method = "dbRDA", use_measure = "bray")
# t1$res_rda is the result list stored in the object
t1$trans_ordination(adjust_arrow_length = TRUE, max_perc_env = 1.5)
# t1$res_rda_trans is the transformed result for plotting
t1$plot_ordination(plot_color = "Group")
```
```{r, out.width = "650px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_rda_dbrda.png")
```
From v0.14.0, the function `cal_ordination_anova` is implemented to check the significance of the ordination model instead of the encapsulation in `cal_ordination`.
Furthermore, the function `cal_ordination_envfit` can be used to get the contribution of each variables to the model.
```{r, echo = TRUE, eval = FALSE}
t1$cal_ordination_anova()
t1$cal_ordination_envfit()
```
Then, let's try to do RDA at the Genus level.
```{r, echo = TRUE, eval = FALSE}
# use Genus
t1$cal_ordination(method = "RDA", taxa_level = "Genus")
# As the main results of RDA are related with the projection and angles between different arrows,
# we adjust the length of the arrow to show them clearly using several parameters.
t1$trans_ordination(show_taxa = 10, adjust_arrow_length = TRUE, max_perc_env = 1.5, max_perc_tax = 1.5, min_perc_env = 0.2, min_perc_tax = 0.2)
# t1$res_rda_trans is the transformed result for plotting
t1$plot_ordination(plot_color = "Group")
```
```{r, out.width = "650px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_rda_genus.png")
```
For more plotting ways, run the following examples.
```{r, echo = TRUE, eval = FALSE}
t1$plot_ordination(plot_color = "Group", plot_shape = "Group")
t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_type = c("point", "ellipse"))
t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_type = c("point", "centroid"))
t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_type = c("point", "chull"))
t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_type = c("point", "ellipse", "centroid"))
t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_type = c("point", "chull", "centroid"), add_sample_label = "SampleID")
t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_type = "centroid", centroid_segment_alpha = 0.9, centroid_segment_size = 1, centroid_segment_linetype = 1)
t1$plot_ordination(plot_color = "Type", plot_type = c("point", "centroid"), centroid_segment_linetype = 1)
```
Mantel test can be used to check whether there is significant correlations between environmental variables and distance matrix.
```{r, echo = TRUE, eval = FALSE}
t1$cal_mantel(use_measure = "bray")
# return t1$res_mantel
t1$res_mantel
```
```{r, echo = FALSE}
t1$cal_mantel(use_measure = "bray")
pander::pander(t1$res_mantel[, -c(2:3)])
```
For the combination of mantel test and correlation heatmap,
please see another example (https://chiliubio.github.io/microeco_tutorial/other-examples-1.html#mantel-test-correlation-heatmap).
The correlations between environmental variables and taxa are important in analyzing and inferring the factors affecting community structure.
Let's first perform a correlation heatmap using relative abundance data at Genus level with the `cal_cor` function.
The parameter `p_adjust_type` can control the p value adjustment type.
The default `p_adjust_type = "Env"` means p adjustment is performed for each environmental variable separately.
If the user needs to adjust p values for all the results together, please use `p_adjust_type = "Type"`.
```{r, echo = TRUE}
t1 <- trans_env$new(dataset = dataset, add_data = env_data_16S[, 4:11])
t1$cal_cor(use_data = "Genus", p_adjust_method = "fdr", p_adjust_type = "Env")
# return t1$res_cor
```
Then, we can plot the correlation results using plot_cor function.
```{r, echo = TRUE, eval = FALSE}
# default ggplot2 method with clustering
t1$plot_cor()
```
There are too many genera.
We can use the filter_feature parameter in plot_cor to filter some taxa that do not have any significance < 0.001.
```{r, echo = TRUE, eval = FALSE}
# filter genera that donot have at least one ***
t1$plot_cor(filter_feature = c("", "*", "**"))
```
Sometimes, if the user wants to do the correlation analysis between the environmental factors and some important taxa detected in the biomarker analysis,
please use **other_taxa** parameter in cal_cor function.
```{r, echo = TRUE, eval = FALSE}
# first create trans_diff object as a demonstration
t2 <- trans_diff$new(dataset = dataset, method = "rf", group = "Group", rf_taxa_level = "Genus")
# then create trans_env object
t1 <- trans_env$new(dataset = dataset, add_data = env_data_16S[, 4:11])
# use other_taxa to select taxa you need
t1$cal_cor(use_data = "other", p_adjust_method = "fdr", other_taxa = t2$res_diff$Taxa[1:40])
t1$plot_cor()
```
```{r, out.width = "700px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_corr_ggplot.png")
```
The pheatmap method is also available.
Note that, besides the **color_vector parameter**,
**color_palette** can also be used to control color palette with customized colors.
```{r, echo = TRUE, eval = FALSE}
# clustering heatmap; require pheatmap package
# Let's take another color pallete
t1$plot_cor(pheatmap = TRUE, color_palette = rev(RColorBrewer::brewer.pal(n = 9, name = "RdYlBu")))
```
```{r, out.width = "700px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_corr_pheatmap.png")
```
Sometimes, if it is necessary to study the correlations between environmental variables and taxa for different groups,
**by_group parameter** can be used for this goal.
```{r, echo = TRUE, eval = FALSE}
# calculate correlations for different groups using parameter by_group
t1$cal_cor(by_group = "Group", use_data = "other", p_adjust_method = "fdr", other_taxa = t2$res_diff$Taxa[1:40])
# return t1$res_cor
t1$plot_cor()
```
```{r, out.width = "700px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_corr_ggplot_groups.png")
```
If the user is concerned with the relationship between environmental factors and alpha diversity,
please use **add_abund_table parameter** in the cal_cor function.
```{r, echo = TRUE, eval = FALSE}
t1 <- trans_env$new(dataset = dataset, add_data = env_data_16S[, 4:11])
# use add_abund_table parameter to add the extra data table
t1$cal_cor(add_abund_table = dataset$alpha_diversity)
# try to use ggplot2 with clustering plot
# require ggtree and aplot packages to be installed (https://chiliubio.github.io/microeco_tutorial/intro.html#dependence)
t1$plot_cor(cluster_ggplot = "row")
```
```{r, out.width = "600px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_corr_alpha_diversity.png")
```
The function plot_scatterfit() in trans_env class is designed for the scatter plot, adding the fitted line and statistics of correlation or regression.
```{r, echo = TRUE, eval = FALSE}
# use pH and bray-curtis distance
# add correlation statistics
t1$plot_scatterfit(
x = "pH",
y = dataset$beta_diversity$bray[rownames(t1$data_env), rownames(t1$data_env)],
type = "cor",
point_alpha = 0.1, label.x.npc = "center", label.y.npc = "bottom",
x_axis_title = "Euclidean distance of pH",
y_axis_title = "Bray-Curtis distance"
)
```
```{r, out.width = "550px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_scatterfit_cor.png")
```
```{r, echo = TRUE, eval = FALSE}
# regression with type = "lm", use group parameter for different groups
t1$plot_scatterfit(
x = dataset$beta_diversity$bray[rownames(t1$data_env), rownames(t1$data_env)],
y = "pH",
type = "lm",
group = "Group",
group_order = c("CW", "TW", "IW"),
point_size = 3, point_alpha = 0.3, line_se = FALSE, line_size = 1.5, shape_values = c(16, 17, 7),
y_axis_title = "Euclidean distance of pH", x_axis_title = "Bray-Curtis distance"
)
```
```{r, out.width = "550px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_scatterfit_lmgroup.png")
```
### Key points
+ complete_na parameter in trans_env$new: used to fill the NA (missing value) of the environmental data based on the mice package.
+ env_cols parameter in trans_env$new: select the variables from sample_table of your microtable object.
+ add_abund_table parameter in cal_cor: other customized data can be also provided for the correlation analysis.
+ use_cor parameter in plot_scatterfit: both the correlation and regression are available in this function.
+ cal_mantel(): partial_mantel = TRUE can be used for partial mantel test.
+ plot_ordination(): use plot_type parameter to select point types and env_nudge_x and taxa_nudge_x (also _y) to adjust the text positions.
## trans_func class
Ecological researchers are usually interested in the the funtional profiles of microbial communities,
because functional or metabolic data is powerful to explain the structure and dynamics of microbial communities.
As metagenomic sequencing is complicated and expensive, using amplicon sequencing data to predict functional profiles is an alternative choice.
Several software are often used for this goal, such as PICRUSt [@Langille_Predictive_2013], Tax4Fun [@Aßhauer_Tax4Fun_2015] and FAPROTAX [@Louca_High_2016; @Louca_Decoupling_2016].
These tools are great to be used for the prediction of functional profiles based on the prokaryotic communities from sequencing results.
In addition, it is also important to obtain the traits or functions for each taxa, not just the whole profile of communities.
FAPROTAX database is a collection of the traits and functions of prokaryotes based on the known research results published in books and literatures.
We match the taxonomic information of prokaryotes against this database to predict the traits of prokaryotes on biogeochemical roles.
The NJC19 database [@Lim_Large_2020] is also available for animal-associated prokaryotic data, such as human gut microbiota.
We also implement the FUNGuild [@Nguyen_FUNGuild_2016] and FungalTraits [@Polme_FungalTraits_2020] databases to predict the fungal traits.
The idea identifying prokaryotic traits and functional redundancy was initially inspired by our another study [@Liu_Microbial_2022].
### Example
We first identify/predict traits of taxa with the prokaryotic example data.
```{r, echo = TRUE}
# create object of trans_func
t2 <- trans_func$new(dataset)
# mapping the taxonomy to the database
# this can recognize prokaryotes or fungi automatically if the names of taxonomic levels are standard.
# for fungi example, see https://chiliubio.github.io/microeco_tutorial/other-dataset.html#fungi-data
# default database for prokaryotes is FAPROTAX database
t2$cal_spe_func(prok_database = "FAPROTAX")
# return t2$res_spe_func, 1 represent trait exists, 0 represent no or cannot confirmed.
```
```{r, echo = TRUE, eval = FALSE}
t2$res_spe_func[1:5, 1:2]
```
```{r, echo = FALSE}
pander::pander(t2$res_spe_func[1:5, 1:2])
```
The percentages of the OTUs having the same trait can reflect the functional redundancy of this function in the community.
```{r, echo = TRUE}
# calculate the percentages for communities
# here do not consider the abundance
t2$cal_spe_func_perc(abundance_weighted = FALSE)
# t2$res_spe_func_perc[1:5, 1:2]
```
```{r, echo = FALSE}
pander::pander(t2$res_spe_func_perc[1:5, 1:2])
```
Then we also take an example to show the percentages of the OTUs for each trait in network modules.
```{r, echo = TRUE, eval = FALSE}
# construct a network for the example
network <- trans_network$new(dataset = dataset, cal_cor = "base", taxa_level = "OTU", filter_thres = 0.0001, cor_method = "spearman")
network$cal_network(p_thres = 0.01, COR_cut = 0.7)
network$cal_module()
# convert module info to microtable object
meco_module <- network$trans_comm(use_col = "module")
meco_module_func <- trans_func$new(meco_module)
meco_module_func$cal_spe_func(prok_database = "FAPROTAX")
meco_module_func$cal_spe_func_perc(abundance_weighted = FALSE)
meco_module_func$plot_spe_func_perc(order_x = paste0("M", 1:10))
```
```{r, out.width = "700px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_func_perc_module.png")
```
```{r, echo = TRUE, eval = FALSE}
# If you want to change the group list, reset the list t2$func_group_list
t2$func_group_list
# use show_prok_func to see the detailed information of prokaryotic traits
t2$show_prok_func("methanotrophy")
```
```{r, echo = TRUE, eval = FALSE}
# then we try to correlate the res_spe_func_perc of communities to environmental variables
t3 <- trans_env$new(dataset = dataset, add_data = env_data_16S[, 4:11])
t3$cal_cor(add_abund_table = t2$res_spe_func_perc, cor_method = "spearman")
t3$plot_cor(pheatmap = TRUE)
```
```{r, out.width = "800px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_func_perc_corr.png")
```
Tax4Fun [@Aßhauer_Tax4Fun_2015] requires a strict input file format associated with the taxonomic information.
To analyze the trimmed or changed OTU data in R with Tax4Fun, we provide a link to the Tax4Fun functional prediction.
Please check out the dependence part https://chiliubio.github.io/microeco_tutorial/intro.html#tax4fun for installing Tax4Fun package and download SILVA123 ref data.
```{r, echo = TRUE, eval = TRUE, message=FALSE}
t1 <- trans_func$new(dataset)
# https://chiliubio.github.io/microeco_tutorial/intro.html#tax4fun for the installation description
# and provide the file path of SILVA123
t1$cal_tax4fun(folderReferenceData = "./SILVA123")
# return two files: t1$tax4fun_KO: KO file; t1$tax4fun_path: pathway file.
# t1$tax4fun_KO$Tax4FunProfile[1:5, 1:2]
```
```{r, echo = FALSE}
pander::pander(t1$tax4fun_KO$Tax4FunProfile[1:5, 1:2])
```
We further analyze the abundance of predicted metabolic pathways.
```{r, echo = TRUE, eval = TRUE}
# must transpose to taxa row, sample column
pathway_file <- t1$tax4fun_path$Tax4FunProfile %>% t %>% as.data.frame
# filter rownames, only keep ko+number
rownames(pathway_file) %<>% gsub("(^.*);\\s.*", "\\1", .)
# load the pathway hierarchical metadata
data(Tax4Fun2_KEGG)
# further create a microtable object, familiar?
func1 <- microtable$new(otu_table = pathway_file, tax_table = Tax4Fun2_KEGG$ptw_desc, sample_table = t1$sample_table)
print(func1)
```
Now, we need to trim data and calculate abundance.
```{r, echo = TRUE, eval = TRUE}
func1$tidy_dataset()
# calculate abundance automatically at three levels: Level.1, Level.2, Level.3
func1$cal_abund()
print(func1)
```
Then, we can plot the abundance.
```{r, echo = TRUE, eval = FALSE}
# bar plot at Level.1
func2 <- trans_abund$new(func1, taxrank = "Level.1", groupmean = "Group")
func2$plot_bar(legend_text_italic = FALSE)
```
```{r, out.width = "600px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_bar_tax4fun1.png")
```
We can also do something else. For example, we can use lefse to test the differences of the abundances and find the important enriched pathways across groups.
```{r, echo = TRUE, eval = FALSE}
func2 <- trans_diff$new(dataset = func1, method = "lefse", group = "Group", alpha = 0.05, lefse_subgroup = NULL)
func2$plot_diff_bar(threshold = 3, width = 0.8)
```
```{r, out.width = "600px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_lefse_bar_tax4fun.png")
```
Tax4Fun2 [@Wemheuer_Tax4Fun2_2020] is another R package for the prediction of functional profiles of prokaryotic communities from 16S rRNA gene sequences.
It also provides two indexes for the evaluation of functional gene redundancies.
If the user want to use Tax4Fun2 method, the representative fasta file is necessary to be added in the microtable object.
Please check out https://chiliubio.github.io/microeco_tutorial/intro.html#tax4fun2 to see
how to read fasta file with `read.fasta` of seqinr package or `readDNAStringSet` of Biostrings package.
Please also see https://chiliubio.github.io/microeco_tutorial/intro.html#tax4fun2 for downloading ncbi-blast and Ref99NR/Ref100NR.
For windows system, ncbi-blast-2.5.0+ is recommended since other versions can not operate well.
```{r, echo = TRUE, eval = FALSE}
# first delete the dataset created before
rm(dataset)
# load the example dataset from microeco package as there is the rep_fasta object in it
data(dataset)
dataset
t1 <- trans_func$new(dataset)
# create a directory for result and log files
dir.create("test_prediction")
# https://chiliubio.github.io/microeco_tutorial/intro.html#tax4fun2 for installation
# ignore blast_tool_path parameter if blast tools have been in path
# the function can search whether blast tool directory is in the path, if not, automatically use provided blast_tool_path parameter
t1$cal_tax4fun2(blast_tool_path = "ncbi-blast-2.5.0+/bin", path_to_reference_data = "Tax4Fun2_ReferenceData_v2",
database_mode = "Ref99NR", path_to_temp_folder = "test_prediction")
# prepare feature table and metadata
data(Tax4Fun2_KEGG)
# create a microtable object for pathways
func2 <- microtable$new(otu_table = t1$res_tax4fun2_pathway, tax_table = Tax4Fun2_KEGG$ptw_desc, sample_table = dataset$sample_table)
func2$tidy_dataset()
func2$cal_abund()
# calculate functional redundancies
t1$cal_tax4fun2_FRI()
```
### Key points
+ blast_tool_path parameter in cal_tax4fun2: if the blast tool has been in 'environment variable' of computer, it is ok to use blast_tool_path = NULL
+ blast version: tax4fun2 require NCBI blast tool. However, some errors often come from the latest versions (https://www.biostars.org/p/413294/). An easy solution is to use previous version (such as v2.5.0).