forked from ChiLiubio/microeco_tutorial
-
Notifications
You must be signed in to change notification settings - Fork 0
/
microeco-tutorial.Rmd
3912 lines (3014 loc) · 163 KB
/
microeco-tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Tutorial for R microeco package (v0.14.1)"
author: "Chi Liu, Felipe R. P. Mansoldo, Umer Zeeshan Ijaz, Chenhao Li, Yang Cao, Jarrod J. Scott, Yaoming Cui, Alane B. Vermelho, Minjie Yao, Xiangzhen Li"
date: "`r Sys.Date()`"
site: bookdown::bookdown_site
output: bookdown::gitbook
documentclass: book
bibliography: [book.bib, packages.bib]
biblio-style: apalike
link-citations: yes
github-repo: rstudio/bookdown-demo
description: "The tutorial for R microeco, file2meco, meconetcomp and mecodev packages"
---
# Background
R language [@R-base] and its packages ecosystem are wonderful tools for data analysis.
In community ecology, a series of packages are available for statistical analysis,
such as vegan [@Jari_vegan_2019], ape [@Paradis_ape_2018] and picante [@Picante_Kembel_2010].
However, with the development of the high-throughput sequencing techniques,
the increasing data amount and complexity of studies make the data mining in microbiome a challenge.
There have been some R packages created specifically for the statistics and visualization of microbiome data,
such as phyloseq [@Mcmurdie_phyloseq_2013],
microbiome (https://github.com/microbiome/microbiome), microbiomeSeq (http://www.github.com/umerijaz/microbiomeSeq),
ampvis2 (https://github.com/KasperSkytte/ampvis2), MicrobiomeR(https://github.com/vallenderlab/MicrobiomeR),
theseus [@Price_theseus_2018], rANOMALY [@Theil_rANOMALY_2021],
tidyMicro [@Carpenter_tidyMicro_2021], microbial (https://github.com/guokai8/microbial),
amplicon (https://github.com/microbiota/amplicon),
MicrobiotaProcess (https://github.com/YuLab-SMU/MicrobiotaProcess)
and so on.
In addition, some web tools associated with R language are also useful for microbiome data analysis,
such as Shiny-phyloseq [@McMurdie_Shiny_2015], MicrobiomeExplorer [@Reeder_MicrobiomeExplorer_2021],
animalcules [@Zhao_animalcules_2021] and Namco [@Dietrich_Namco_2022].
Even so, researchers still lack a flexible, comprehensive and modularized R package to analyze and manage the data fast and easily.
Based on this background, we created the R microeco package [@Liu_microeco_2021] (https://github.com/ChiLiubio/microeco).
Besides, we also developed the file2meco package (https://github.com/ChiLiubio/file2meco) for the data input from some famous tools easily.
<!--chapter:end:index.Rmd-->
# Introduction {#intro}
The microeco package has several advantages compared to other packages in R.
The main goal of developing this package is to help users analyze microbiome data fast.
So a series of commonly-used and cutting-edge approaches are implemented.
To facilitate the data mining, the whole structure of microeco package are highly modularized to
make users conveniently remember, search and use the classes.
It is notable that, beside the demonstration in the tutorial, users can also save the intermediate files in each object and
apply those files to other tools according to the format requirement.
Main files stored in the object of each class are the frequently-used data.frame format.
So the intermediate and result files are easily saved, modified and used for other tools in microbial ecology.
Before starting the specific usage of each class, let's first introduce several key points.
## Framework
This is a rough framework for users to fast understand the design of microeco package.
```{r, out.width = "800px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/microeco_framework.png")
```
The stored 'Functions' and 'Files' represent that the user can access those functions or files in R6 object using $ operator
as shown in the figure. An example is the function `dataset$cal_alphadiv()` and its return result `dataset$alpha_diversity`.
The dataset is a microtable object.
Generally, the return files of functions are named with the prefix 'res_' to make users easily find them when using Rstudio and the keyboard shortcuts (Tab).
Except for microtable class, the transformed data in created object is generally named with the prefix 'data_'.
## R6 Class
All the main classes in microeco package depend on the R6 class [@R6_Winston].
R6 uses the encapsulated object-oriented (OO) programming paradigm,
which means that R6 is a profoundly different OO system from S3 and S4 because it is built on encapsulated objects, rather than generic functions.
If the user is interested in the class features, read more from 'Advanced R' book (https://adv-r.hadley.nz/).
+ A generic is a regular function, so it lives in the global namespace. An R6 method belongs to an object so it lives in a local namespace.
This influences how we think about naming. The methods belong to objects, not generics, and the user can call them like object$method().
+ R6’s reference semantics allow methods to simultaneously return a value and modify an object.
+ Every R6 object has an S3 class that reflects its hierarchy of R6 class.
## Help
The usage of help documents in the microeco package may be a little different from other packages we often used.
If the user wish to see the help document of a function, please search the name of the class it belongs to (not the name of the function)
and click the link of the function.
```{r, echo = TRUE}
# first install microeco, see https://github.com/ChiLiubio/microeco
# load package microeco
library(microeco)
```
```{r, echo = TRUE, eval = FALSE}
# this can show all the functions and the detailed descriptions of the microtable class
# same with: help(microtable)
?microtable
```
## RTools
For Windows system, RTools (https://cran.r-project.org/bin/windows/Rtools/) is necessary to install some R packages from source, such as R packages deposited in GitHub.
## Dependence
### Important packages
To keep the start and use of microeco package simplified,
the installation of microeco only depend on several packages, which are compulsory-installed from CRAN and frequently used in the data analysis.
So the question is that the user may encounter an error when using a class or function that invoke an additional package like this:
```{r, echo = TRUE, eval = FALSE}
library(microeco)
data(dataset)
t1 <- trans_network$new(dataset = dataset, filter_thres = 0.001)
t1$cal_network(network_method = "SpiecEasi")
```
```html
Error in t1$cal_network(network_method = "SpiecEasi"): igraph package not installed ...
```
<br>
The reason is that network construction requires igraph package. We donot put the igraph and some other packages on the "Imports" part of microeco package.
In addition, some packages, e.g. SpiecEasi, are released on github and can not be installed automatically.
The solutions:
1. Install the package when encounter such an error. Actually, it's very easy to install the packages from CRAN or bioconductor or github. Just try it.
2. Install the packages in advance.
This is recommended if the user is interested in most of the methods of microeco package and want to run a large number of examples in this tutorial.
### CRAN packages
We show some packages that are published in CRAN and not installed automatically.
Those packages are necessary to reproduce some parts of the tutorial.
```{r, echo = FALSE, eval = TRUE}
pre_pac <- read.delim("prerequite_packages.tsv")
```
```{r, echo = FALSE, eval = TRUE}
knitr::kable(pre_pac)
```
Then, if you want to install these packages or some of them, you can do like this:
```{r, echo = TRUE, eval = FALSE}
# If a package is not installed, it will be installed from CRAN.
# First select the packages of interest
packages <- c("MASS", "GUniFrac", "ggpubr", "randomForest", "ggdendro", "ggrepel", "agricolae", "picante", "pheatmap", "igraph", "rgexf",
"ggalluvial", "ggh4x", "rcompanion", "FSA", "gridExtra", "aplot", "NST", "GGally")
# Now check or install
for(x in packages){
if(!require(x, character.only = TRUE)) {
install.packages(x, dependencies = TRUE)
}
}
```
### ggtree
Plotting the cladogram from LEfSe result requires the ggtree package in bioconductor (https://bioconductor.org/packages/release/bioc/html/ggtree.html).
```{r, echo = TRUE, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("ggtree")
```
### Gephi
Gephi is an excellent network visualization tool and used to open the saved network file,
i.e. network.gexf in the [tutorial](https://chiliubio.github.io/microeco_tutorial/model-based-class.html#trans_network-class).
You can download Gephi and learn how to use it from https://gephi.org/users/download/
### Tax4Fun
Tax4Fun is an R package used for predicting the functional potential of prokaryotic communities.
1. install Tax4Fun package
```{r, echo = TRUE, eval = FALSE}
install.packages("RJSONIO")
install.packages(system.file("extdata", "biom_0.3.12.tar.gz", package="microeco"), repos = NULL, type = "source")
install.packages(system.file("extdata", "qiimer_0.9.4.tar.gz", package="microeco"), repos = NULL, type = "source")
install.packages(system.file("extdata", "Tax4Fun_0.3.1.tar.gz", package="microeco"), repos = NULL, type = "source")
```
2. download SILVA123 reference data from http://tax4fun.gobics.de/
unzip SILVA123.zip and provide this path to the folderReferenceData parameter of cal_tax4fun function in trans_func class.
### Tax4Fun2
Tax4Fun2 is another R package for the the prediction of functional profiles and functional gene redundancies of prokaryotic communities [@Wemheuer_Tax4Fun2_2020].
It has higher accuracies than PICRUSt and Tax4Fun. The Tax4Fun2 approach implemented in microeco is a little different from the original package.
Using Tax4Fun2 approach require the representative fasta file.
The user do not need to install Tax4Fun2 R package again.
The only thing need to do is to download the blast tool (**ignore this if the blast tool has been in the path**) and Ref99NR/Ref100NR database (select one).
Download blast tools from "ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+" ; e.g. ncbi-blast-\*\*\*\*-x64-win64.tar.gz for windows system.
Note that some errors can come from the latest versions because of memory issue (https://www.biostars.org/p/413294/).
An easy solution is to use previous version (such as 2.5.0).
Download Ref99NR.zip from "https://cloudstor.aarnet.edu.au/plus/s/DkoZIyZpMNbrzSw/download" or Ref100NR.zip from "https://cloudstor.aarnet.edu.au/plus/s/jIByczak9ZAFUB4/download" .
Uncompress all the folders. The final folders should be like these structures:
blast tools:
|-- ncbi-blast-2.5.0+
|---- bin
|------ blastn.exe
|------ makeblastdb.exe
|------ ......
Ref99NR:
|-- Tax4Fun2_ReferenceData_v2
|---- Ref99NR
|------ otu000001.tbl.gz
|------ ......
|------ Ref99NR.fasta
|------ Ref99NR.tre
The path "Tax4Fun2_ReferenceData_v2" will be required in the trans_func$cal_tax4fun2() function.
The blast tool path "ncbi-blast-2.5.0+/bin" is also required if it is not added to the system env path (environmental variable).
```{r, echo = TRUE, eval = FALSE}
# Either seqinr or Biostrings package should be installed for reading and writing fasta file
install.packages("seqinr", dependencies = TRUE)
# or install Biostrings from bioconductor https://bioconductor.org/packages/release/bioc/html/Biostrings.html
# Now we show how to read the fasta file
# see https://github.com/ChiLiubio/file2meco to install file2meco
rep_fasta_path <- system.file("extdata", "rep.fna", package="file2meco")
rep_fasta <- seqinr::read.fasta(rep_fasta_path)
# or use Biostrings package
rep_fasta <- Biostrings::readDNAStringSet(rep_fasta_path)
# try to create a microtable object with rep_fasta
data("otu_table_16S")
# In microtable class, all the taxa names should be necessarily included in rep_fasta
otu_table_16S <- otu_table_16S[rownames(otu_table_16S) %in% names(rep_fasta), ]
test <- microtable$new(otu_table = otu_table_16S, rep_fasta = rep_fasta)
test
```
## Plot
Most of the plots in the package rely on the ggplot2 package system.
We provide some parameters to optimize the corresponding plot, but it may be far from enough.
The user can also assign the output a name and use the ggplot2-style grammers to modify it.
Each data table used for visualization is stored in the object and can be saved for the customized analysis.
Of course, the user can also directly modify the class and reload them to use.
Any contribution of a modified class is appreciated via Github-Pull requests (https://github.com/ChiLiubio/microeco_tutorial/pulls) or Email ([email protected]).
<!--chapter:end:01-intro.Rmd-->
# Basic class
The microtable class is the basic class.
All the other classes depend on the microtable class.
```{r, out.width = "8000px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/microtable_framework.png")
```
The objects inside the rectangle with full line represent functions.
The red rectangle means it is extremely important function.
The dashed line denotes the key objects (input or output of functions) that deserve more attention.
## microtable class
Many tools can be used for the bioinformatic analysis of amplicon sequencing data, such as QIIME [@Caporaso_QIIME_2010], QIIME2 [@Bolyen_Reproducible_2019],
usearch (https://www.drive5.com/usearch/), mothur [@Schloss_Introducing_2009],
SILVAngs (https://ngs.arb-silva.de/silvangs/),
and RDP (http://rdp.cme.msu.edu/).
Although the formats of result files may vary across tools, the main contents can be generally classified into the following parts:
(1) OTU/ASV table, i.e. the feature-sample abundance table;
(2) taxonomic assignment table;
(3) representative sequences;
(4) phylogenetic tree;
(5) metadata.
It is generally useful to create a detailed sample metadata table to store all the sample information (including the environmental data).
The microtable class is the basic class and designed to store the basic data for all the downstream analysis in the microeco package.
At least, the OTU table (i.e. feature-sample abundance table) should be provided to create microtable object.
Thus, the microtable class can determine that the sample information table is missing and create a default sample table according to
sample names in otu_table.
To make the file input more convenient,
we also build another R package file2meco (https://github.com/ChiLiubio/file2meco) to read the output files of some tools into microtable object.
Currently, those tools/softwares include not only commonly-used QIIME [@Caporaso_QIIME_2010] and QIIME2[@Bolyen_Reproducible_2019],
but also several metagenomic tools, such as HUMAnN [@Franzosa_Species_2018] and kraken2 [@Wood_Improved_2019].
In this tutorial, the data inside the package was employed to show some basic operations.
### Prepare the example data
The 16S rRNA gene sequencing results in the example data of the package is used to show the main part of the tutorial.
This dataset is the 16S rRNA gene Miseq sequencing results of wetland soils in China published by An et al. [@An_Soil_2019],
who surveyed soil prokaryotic communities in Chinese inland wetlands (IW),
coastal wetland (CW) and Tibet plateau wetlands (TW) using amplicon sequencing.
These wetlands include both saline and non-saline samples (classified for the tutorial).
The sample information table has 4 columns: "SampleID", "Group", "Type" and "Saline".
The column "SampleID" is same with the rownames.
The column "Group" represents the IW, CW and TW.
The column "Type" means the sampling region: northeastern region (NE), northwest region (NW), North China area (NC),
middle-lower reaches of the Yangtze River (YML), southern coastal area (SC), upper reaches of the Yangtze River (YU), Qinghai-Tibet Plateau (QTP).
The column "Saline" denotes the saline soils and non-saline soils.
In this dataset, the environmental factor table is separated from the sample information table.
It is also recommended to put all the environmental data into sample information table.
```{r, echo = TRUE}
library(microeco)
# load the example data; 16S rRNA gene amplicon sequencing dataset
# metadata table; data.frame
data(sample_info_16S)
# feature table; data.frame
data(otu_table_16S)
# taxonomic assignment table; data.frame
data(taxonomy_table_16S)
# phylogenetic tree; not necessary; use for the phylogenetic analysis
# Newick format; use read.tree function of ape package to read a tree
data(phylo_tree_16S)
# load the environmental data table if it is not in sample table
data(env_data_16S)
# use pipe operator in magrittr package
library(magrittr)
# fix the random number generation to make the results repeatable
set.seed(123)
# make the plotting background same with the tutorial
library(ggplot2)
theme_set(theme_bw())
```
Make sure that the data types of sample_table, otu_table and tax_table are all `data.frame` format as the following part shows.
```{r, echo = TRUE}
class(otu_table_16S)
```
```{r, echo = TRUE, eval = FALSE}
otu_table_16S[1:5, 1:5]
```
```{r, echo = FALSE}
pander::pander(otu_table_16S[1:5, 1:5])
```
```{r, echo = TRUE}
class(taxonomy_table_16S)
```
```{r, echo = TRUE, eval = FALSE}
taxonomy_table_16S[1:5, 1:3]
```
```{r, echo = FALSE}
pander::pander(taxonomy_table_16S[1:5, 1:3])
```
Generally, users' taxonomic table has some messy information, such as NA, unidentified and unknown.
These information can potentially influence the following taxonomic abundance calculation and other taxonomy-based analysis.
So it is usually necessary to clean this data using the `tidy_taxonomy` function.
Another very important result of this operation is to **unify the taxonomic prefix** automatically,
e.g., converting D_1__ to p__ for Phylum level or adding p__ to Phylum directly if no prefix is found.
```{r, echo = TRUE, eval = FALSE}
# make the taxonomic information unified, very important
taxonomy_table_16S %<>% tidy_taxonomy
```
The rownames of sample_table in microtable object (i.e. sample names) are used for selecting samples/groups in all the related operations in the package.
Using pure number as sample names is **not recommended** in case of unknown disorder or man-made mistake.
**Before creating microtable object, make sure that the rownames of sample information table are sample names**.
```{r, echo = TRUE}
class(sample_info_16S)
```
```{r, echo = TRUE, eval = FALSE}
sample_info_16S[1:5, ]
```
```{r, echo = FALSE}
pander::pander(sample_info_16S[1:5, ])
```
In this example, the environmental data is stored in the env_data_16S alone.
The user can also directly integrate those data into the sample information table.
```{r, echo = TRUE}
class(env_data_16S)
```
```{r, echo = FALSE}
pander::pander(env_data_16S[1:5, 1:5])
```
```{r, echo = TRUE}
class(phylo_tree_16S)
```
Then, we create an object of microtable class.
This operation is very similar with the package phyloseq[@Mcmurdie_phyloseq_2013], but in microeco it is more brief.
The otu_table in the microtable class must be the feature-sample format: rownames - OTU/ASV/pathway/other names; colnames - sample names.
**The colnames in otu_table must have overlap with rownames of sample_table**.
Otherwise, the following check can filter all the samples of otu_table because of no same sample names between otu_table and sample_table.
```{r, echo = TRUE}
# In R6 class, '$new' is the original method used to create a new object of class
# If you only provide abundance table, the class can help you create a sample info table
dataset <- microtable$new(otu_table = otu_table_16S)
class(dataset)
# generally add the metadata
dataset <- microtable$new(otu_table = otu_table_16S, sample_table = sample_info_16S)
dataset
# Let's create a microtable object with more information
dataset <- microtable$new(sample_table = sample_info_16S, otu_table = otu_table_16S, tax_table = taxonomy_table_16S, phylo_tree = phylo_tree_16S)
dataset
```
### How to read your files to microtable object?
The above-mentioned example data are directly loaded from microeco package.
So the question is __how to read your data to create a microtable object?__
There are two ways:
▲ 1. __Use file2meco package__
R package file2meco (https://chiliubio.github.io/microeco_tutorial/file2meco-package.html) is designed to directly read the output files of some famous tools into microtable object.
Currently, it supports QIIME [@Caporaso_QIIME_2010], QIIME2[@Bolyen_Reproducible_2019],
HUMAnN [@Franzosa_Species_2018], MetaPhlAn [@Truong_MeTApHLaN2_2015], kraken2 [@Wood_Improved_2019], phyloseq [@Mcmurdie_phyloseq_2013], etc.
Please read the tutorial of file2meco package for more detailed information (https://chiliubio.github.io/microeco_tutorial/file2meco-package.html).
▲ 2. __Other cases__
To transform customized files to microtable object,
there should be two steps:
__I) read files to R__
The required format of microtable\$new parameters, __otu_table__, __sample_table__ and __tax_table__, are all the data.frame, which is the most frequently-used data format in R.
So no matter what the format the files are, they should be first read into R with some functions, such as `read.table` and `read.csv`.
If the user want to perform phylogenetic analysis, please also read your phylogenetic tree using `read.tree` function of ape package and
provide the tree to the __phylo_tree__ parameter of microtable\$new function like the above example.
__II) create the microtable object__
Then the user can create the microtable object like the operation in the last section.
Please also see the help document of the microtable class for detailed descriptions using the following help command.
```{r, echo = TRUE, eval = FALSE}
# search the class name, not the function name
?microtable
# then see microtable$new()
```
### Functions in microtable class
Then, we remove OTUs which are not assigned in the Kingdom "k__Archaea" or "k__Bacteria".
```{r, echo = TRUE}
# use R subset function to filter taxa in tax_table
dataset$tax_table %<>% base::subset(Kingdom == "k__Archaea" | Kingdom == "k__Bacteria")
# another way with grepl function
dataset$tax_table %<>% .[grepl("Bacteria|Archaea", .$Kingdom), ]
dataset
```
We also remove OTUs with the taxonomic assignments "mitochondria" or "chloroplast".
```{r, echo = TRUE}
# This will remove the lines containing the taxa word regardless of taxonomic ranks and ignoring word case in the tax_table.
# So if you want to filter some taxa not considerd pollutions, please use subset like the previous operation to filter tax_table.
dataset$filter_pollution(taxa = c("mitochondria", "chloroplast"))
dataset
```
To make the OTU and sample information consistent across all files in the dataset object, we use function `tidy_dataset` to trim the dataset.
```{r, echo = TRUE}
dataset$tidy_dataset()
print(dataset)
```
Then let's use sample_sums() to check the sequence numbers in each sample.
```{r, echo = TRUE}
dataset$sample_sums() %>% range
```
Sometimes, in order to reduce the effects of sequencing depth on the diversity measurements,
it is optional to perform the resampling to make the sequence number equal for each sample.
The function `rarefy_samples` can invoke the function `tidy_dataset` automatically before and after the rarefying.
```{r, echo = TRUE}
# As an example, use 10000 sequences in each sample
dataset$rarefy_samples(sample.size = 10000)
dataset$sample_sums() %>% range
```
Then, let's calculate the taxa abundance at each taxonomic rank using `cal_abund()`.
This function **generate a list called taxa_abund stored in the microtable object**.
This list contain several data frame of the abundance information at each taxonomic rank.
It's worth noting that the `cal_abund()` function can be used to **solve more complicated cases with special parameters**,
such as supporting both the relative and absolute abundance calculation and selecting the partial 'taxonomic' columns.
Those have been shown in file2meco package part (https://chiliubio.github.io/microeco_tutorial/file2meco-package.html#humann-metagenomic-results) with complex metagenomic dataset.
```{r, echo = TRUE}
# use default parameters
dataset$cal_abund()
# return dataset$taxa_abund
class(dataset$taxa_abund)
```
```{r, echo = TRUE, eval = FALSE}
# show part of the relative abundance at Phylum level
dataset$taxa_abund$Phylum[1:5, 1:5]
```
```{r, echo = FALSE}
pander::pander(dataset$taxa_abund$Phylum[1:5, 1:5])
```
The function `save_abund()` can be used to save the taxa abundance file to a local place easily.
```{r, echo = TRUE, eval = FALSE}
dataset$save_abund(dirpath = "taxa_abund")
```
Then, let's calculate the alpha diversity.
The result is also stored in the object microtable automatically.
```{r, echo = TRUE}
# If you want to add Faith's phylogenetic diversity, use PD = TRUE, this will be a little slow
dataset$cal_alphadiv(PD = FALSE)
# return dataset$alpha_diversity
class(dataset$alpha_diversity)
```
```{r, echo = TRUE, eval = FALSE}
# save dataset$alpha_diversity to a directory
dataset$save_alphadiv(dirpath = "alpha_diversity")
```
Let's go on to beta diversity with function `cal_betadiv()`.
If method parameter is not provided, the function automatically calculates Bray-curtis, Jaccard, weighted Unifrac and unweighted unifrac matrixes.
```{r, echo = FALSE, eval = TRUE, message = FALSE}
invisible(dataset$cal_betadiv(unifrac = FALSE))
```
```{r, echo = TRUE, eval = FALSE}
# unifrac = FALSE means do not calculate unifrac metric
# require GUniFrac package installed
dataset$cal_betadiv(unifrac = TRUE)
# return dataset$beta_diversity
class(dataset$beta_diversity)
# save dataset$beta_diversity to a directory
dataset$save_betadiv(dirpath = "beta_diversity")
```
### subset of samples
We donnot provide a special function to filter samples in microtable object, as we think it is redundant.
**We recommend manipulating the sample_table in microtable object directly.**
For example, if you want to extract samples of 'CW' group, please do like this:
```{r, echo = TRUE}
# remember first clone the whole dataset
# see https://chiliubio.github.io/microeco_tutorial/notes.html#clone-function
group_CW <- clone(dataset)
# select 'CW'
group_CW$sample_table <- subset(group_CW$sample_table, Group == "CW")
# or: group_CW$sample_table <- subset(group_CW$sample_table, grepl("CW", Group))
# use tidy_dataset to trim all the basic files
group_CW$tidy_dataset()
group_CW
```
### filter features
Please use filter_taxa function to filter the features with low abundance or occurrence frequency.
For other operations on the features, please directly manipulate the otu_table of your microtable object.
```{r, echo = TRUE}
# It is better to have a backup before filtering features
dataset_filter <- clone(dataset)
# mean relative abundance threshold 0.0001
# occurrence frequency 0.1; 10% samples have the target features
dataset_filter$filter_taxa(rel_abund = 0.0001, freq = 0.1)
```
### merge taxa or samples
Merging taxa according to a specific taxonomic rank level of tax_table can generate a new microtable object.
In the new microtable object, each feature in otu_table represents one taxon at the output level.
```{r, echo = TRUE}
test <- dataset$merge_taxa(taxa = "Family")
test
```
Similarly, merging samples according to a specific group of sample_table can also generate a new microtable object.
```{r, echo = TRUE}
test <- dataset$merge_samples(use_group = "Group")
test
```
### Other examples
In microtable$new, if auto_tidy = TRUE, the function can automatically use tidy_dataset to make all files uniform.
Then, all other functions in microtable will also do this. But if the user changes the file in microtable object,
the class can not recognize this modification, the user should use `tidy_dataset` function to manually trim the microtable object.
```{r, echo = TRUE, eval = TRUE}
test <- microtable$new(sample_table = sample_info_16S[1:40, ], otu_table = otu_table_16S, auto_tidy = FALSE)
test
test1 <- microtable$new(sample_table = sample_info_16S[1:40, ], otu_table = otu_table_16S, auto_tidy = TRUE)
test1
test1$sample_table %<>% .[1:10, ]
test1
test1$tidy_dataset()
test1
```
The phylogenetic tree can be read with `read.tree` function in ape package.
```{r, echo = TRUE, eval = FALSE}
# use the example data rep_phylo.tre in file2meco package https://chiliubio.github.io/microeco_tutorial/file2meco-package.html#qiime
phylo_file_path <- system.file("extdata", "rep_phylo.tre", package="file2meco")
tree <- ape::read.tree(phylo_file_path)
```
Other functions and examples are listed here.
```{r, echo = TRUE, eval = FALSE}
# clone a complete dataset named test
test <- clone(dataset)
# add the rownames of tax_table as the last column of tax_table directly; useful in some analysis, e.g. biomarker finding at OTU/ASV level.
ncol(test$tax_table)
test$add_rownames2taxonomy(use_name = "OTU")
ncol(test$tax_table)
# sum the abundance for each taxa; very useful for taxa abundance checking and filtering
test$taxa_sums()
# rename feature names in all the files of microtable object
test$rename_taxa(newname_prefix = "new_name_")
rownames(test$otu_table)[1:5]
rownames(test$tax_table)[1:5]
# output sample names in microtable object
test$sample_names()[1:5]
# output taxa names in microtable object
test$taxa_names()[1:5]
```
### Key points
+ sample_table: rownames of sample_table must be sample names used
+ otu_table: rownames must be feature names; colnames must be sample names
+ `microtable` class: creating microtable object requires at least one file input (otu_table)
+ `tidy_taxonomy()`: necessary to make taxonomic table have unified format
+ `tidy_dataset()`: necessary to trim files in microtable object
+ `add_rownames2taxonomy()`: add the rownames of tax_table as the last column of tax_table
+ `cal_abund()`: powerful and flexible to cope with complex cases in tax_table, see the parameters
+ taxa_abund: taxa_abund is a list stored in microtable object and have several data frame
+ beta_diversity: beta_diversity is a list stored in microtable object and have several distance matrix
<!--chapter:end:02-Basic_class.Rmd-->
# Composition-based class
The trans_abund class and trans_venn class are organised into the section 'Composition-based class',
since they are mainly used to show the composition information of communities.
## trans_abund class
The trans_abund class has several functions to visualize taxonomic abundance based on the ggplot2 package.
### Example
We first show the bar plot example.
```{r, echo = TRUE}
# create trans_abund object
# use 10 Phyla with the highest abundance in the dataset.
t1 <- trans_abund$new(dataset = dataset, taxrank = "Phylum", ntaxa = 10)
# t1 object now include the transformed abundance data t1$abund_data and other elements for the following plotting
```
As the sample number is large, we do not show the sample names in x axis and add the facet to show abundance according to groups.
```{r, echo = TRUE, eval = FALSE}
t1$plot_bar(others_color = "grey70", facet = "Group", xtext_keep = FALSE, legend_text_italic = FALSE)
# return a ggplot2 object
```
```{r, out.width = "750px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_bar.png")
```
Two or more facets are supported with the facet parameter from v0.14.0 by providing a vector with multiple elements.
```{r, echo = TRUE, eval = FALSE}
# require package ggh4x, first run install.packages("ggh4x") if not installed
t1$plot_bar(others_color = "grey70", facet = c("Group", "Type"), xtext_keep = FALSE, legend_text_italic = FALSE, barwidth = 1)
```
```{r, out.width = "750px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_bar_facet2.png")
```
The default operation can filter all the unclassified taxa (i.e. p__ or g__ in tax_table that has been processed by `tidy_taxonomy` function),
as those unknown taxa are generally meaningless.
However sometimes, these unknown taxa may be meaningful for users.
For example, if one want to isolate some unknown species, it is valuable to check the abundance of those unknown taxa.
At this time, please see this topic (https://github.com/ChiLiubio/microeco/issues/165) to resolve the issue that how to show unknown taxa with hierarchical taxonomy classification.
The alluvial plot is also implemented in the plot_bar function with use_alluvium parameter.
```{r, echo = TRUE, eval = FALSE}
t1 <- trans_abund$new(dataset = dataset, taxrank = "Genus", ntaxa = 8)
# require ggalluvial package
# use_alluvium = TRUE make the alluvial plot, clustering =TRUE can be used to reorder the samples by clustering
# bar_type = "notfull" can discard 'others'; select another color palette
t1$plot_bar(bar_type = "notfull", use_alluvium = TRUE, clustering = TRUE, xtext_type_hor = FALSE, xtext_size = 6, color_values = RColorBrewer::brewer.pal(8, "Set2"))
```
```{r, fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_bar_allu.png")
```
The bar plot can also be performed with group mean values.
```{r, echo = TRUE, eval = FALSE}
# The groupmean parameter can be used to obtain the group-mean barplot.
t1 <- trans_abund$new(dataset = dataset, taxrank = "Phylum", ntaxa = 10, groupmean = "Group")
g1 <- t1$plot_bar(others_color = "grey70", legend_text_italic = FALSE)
g1 + theme_classic() + theme(axis.title.y = element_text(size = 18))
```
```{r, out.width = "400px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_bar_mean.png")
```
The box plot is an excellent way to intuitionally show abundance distribution across groups.
```{r, echo = TRUE, eval = FALSE}
# show 15 taxa at Class level
t1 <- trans_abund$new(dataset = dataset, taxrank = "Class", ntaxa = 15)
t1$plot_box(group = "Group")
```
```{r, out.width = "700px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_box.png")
```
Then we show the heatmap with the high abundant genera.
```{r, echo = TRUE, eval = FALSE}
# show 40 taxa at Genus level
t1 <- trans_abund$new(dataset = dataset, taxrank = "Genus", ntaxa = 40)
t1$plot_heatmap(facet = "Group", xtext_keep = FALSE, withmargin = FALSE)
```
```{r, out.width = "750px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_heatmap.png")
```
Line chart is very useful to show the abundance change of taxa along time, space or other gradients.
```{r, echo = TRUE, eval = FALSE}
t1 <- trans_abund$new(dataset = dataset, taxrank = "Genus", ntaxa = 5)
t1$plot_line()
t1 <- trans_abund$new(dataset = dataset, taxrank = "Genus", ntaxa = 5, group = "Type")
t1$plot_line(position = position_dodge(0.3), xtext_type_hor = TRUE)
```
```{r, out.width = "750px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_line_type.png")
```
Then, we show the pie chart with the group mean values.
```{r, echo = TRUE, eval = FALSE}
t1 <- trans_abund$new(dataset = dataset, taxrank = "Phylum", ntaxa = 6, groupmean = "Group")
# all pie chart in one row
t1$plot_pie(facet_nrow = 1)
```
```{r, out.width = "600px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_pie.png")
```
### Key points
+ trans_abund$new: creating trans_abund object can invoke taxa_abund in microtable for transformation
+ color_values parameter: color_values parameter in each function is used for colors selection
+ input_taxaname parameter: input_taxaname parameter in trans_abund$new can be used to select interested customized taxa instead of abundance-based selection
+ use_percentage parameter: use_percentage parameter in trans_abund$new - whether show the abundance percentage
## trans_venn class
The trans_venn class is developed for venn analysis, i.e. shared and unique taxa across samples/groups.
### Example
This part can be performed using samples or groups at OTU/ASV level or higher taxonomic level.
To analyze the unique and shared OTUs of groups,
we first merge samples according to the "Group" column of sample_table.
```{r, echo = TRUE, eval = FALSE}
# merge samples as one community for each group
dataset1 <- dataset$merge_samples(use_group = "Group")
# dataset1 is a new microtable object
# create trans_venn object
t1 <- trans_venn$new(dataset1, ratio = NULL)
t1$plot_venn()
```
```{r, out.width = "500px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/trans_venn_0.png")
```
```{r, echo = TRUE, eval = FALSE}
# create venn plot with more information
t1 <- trans_venn$new(dataset1, ratio = "seqratio")
t1$plot_venn()
# The integer is OTU number
# The percentage data is the sequence number/total sequence number
```
```{r, out.width = "500px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/trans_venn_1.png")
```
When the groups are too many to show with venn plot, using petal plot is better.
```{r, echo = TRUE, eval = FALSE}
# use "Type" column in sample_table
dataset1 <- dataset$merge_samples(use_group = "Type")
t1 <- trans_venn$new(dataset1)
t1$plot_venn(petal_plot = TRUE, petal_center_size = 50, petal_r = 1.5, petal_a = 3, petal_move_xy = 3.8, petal_color_center = "#BEBADA")
```
```{r, out.width = "500px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/trans_venn_2.png")
```
Another way to plot the results is to use plot_bar function, which is especially useful for a large number of samples/groups.
This way is generally called UpSet plot.
```{r, echo = TRUE, eval = FALSE}
tmp <- dataset$merge_samples(use_group = "Type")
tmp
t1 <- trans_venn$new(dataset = tmp)
# only show some sets with large intersection numbers
t1$data_summary %<>% .[.[, 1] > 20, ]
t1$plot_bar()
```
```{r, out.width = "800px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_venn_bar_type.png")
```
Generally, after getting the intersection results, we do not know who those shared or unique taxa are.
The composition of the unique or shared species may account for the different and similar parts of ecological characteristics across groups[@Mendes_Deciphering_2011].
So, it is interesting to further analyze the composition of unique and shared species.
For this goal, we first transform the results of venn plot to the traditional feature-sample table, that is, another object of microtable class.
```{r, echo = TRUE, eval = TRUE}
dataset1 <- dataset$merge_samples(use_group = "Group")
t1 <- trans_venn$new(dataset1)
# transform venn results to the sample-species table, here do not consider abundance, only use presence/absence.
t2 <- t1$trans_comm(use_frequency = TRUE)
# t2 is a new microtable class, each part is considered a sample
class(t2)
```
We use bar plot to show the composition at the Genus level.
```{r, echo = TRUE, eval = FALSE}
# calculate taxa abundance, that is, the frequency
t2$cal_abund()
# transform and plot
t3 <- trans_abund$new(dataset = t2, taxrank = "Genus", ntaxa = 8)
t3$plot_bar(bar_type = "part", legend_text_italic = T, ylab_title = "Frequency (%)", xtext_type_hor = FALSE, color_values = RColorBrewer::brewer.pal(8, "Set2"),
order_x = c("IW", "CW", "TW", "IW&CW", "IW&TW", "CW&TW", "IW&CW&TW"))
```
```{r, out.width = "650px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/trans_venn_bar.png")
```
We also try to use pie chart to show the compositions at the Phylum level.
```{r, echo = TRUE, eval = FALSE}
t3 <- trans_abund$new(dataset = t2, taxrank = "Phylum", ntaxa = 8)
t3$plot_pie(facet_nrow = 3, color_values = rev(c(RColorBrewer::brewer.pal(8, "Dark2"), "grey50")))
```
```{r, out.width = "800px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/trans_venn_pie.png")
```
### Key points
+ ratio parameter: ratio parameter in trans_abund$new control whether and what content appear below the taxa number in venn plot
+ return data: using trans_venn$new() return data_details and data_summary stored in trans_venn object for further ploting
<!--chapter:end:03-Composition-based_class.Rmd-->
# Diversity-based class
Diversity is one of the core topics in community ecology.
It refers to alpha diversity, beta diversity and gamma diversity.
## trans_alpha class
Alpha diversity can be transformed and visualized using trans_alpha class.
Creating the object of trans_alpha class can invoke the alpha_diversity data stored in the microtable object.
### Example
Creating trans_alpha object have two return data.frame with prefix 'data_': `data_alpha` and `data_stat`.
The data_alpha is used for the following differential test and visualization.
```{r, echo = TRUE, eval = FALSE}
t1 <- trans_alpha$new(dataset = dataset, group = "Group")
# return t1$data_stat
t1$data_stat[1:5, ]
```
```{r, echo = FALSE}
t1 <- trans_alpha$new(dataset = dataset, group = "Group")
pander::pander(t1$data_stat[1:5, ])
```
Then, we test the differences among groups using Kruskal-Wallis Rank Sum Test (overall test when groups > 2), Wilcoxon Rank Sum Tests (for paired groups),
Dunn's Kruskal-Wallis Multiple Comparisons (for paired groups when groups > 2) and anova with multiple comparisons.
```{r, echo = TRUE, eval = FALSE}
t1$cal_diff(method = "KW")
# return t1$res_diff
t1$res_diff[1:5, ]
```
```{r, echo = FALSE}
suppressWarnings(t1$cal_diff(method = "KW"))
pander::pander(t1$res_diff[1:5, c(1:2, 4:7)])
```
```{r, echo = TRUE, eval = FALSE}
t1$cal_diff(method = "KW_dunn")
# return t1$res_diff
t1$res_diff[1:5, ]
```
```{r, echo = FALSE}
t1$cal_diff(method = "KW_dunn")
pander::pander(t1$res_diff[1:5, c(1, 3:8)])
```
```{r, echo = TRUE, eval = FALSE}
# more options
t1$cal_diff(method = "wilcox")
t1$cal_diff(method = "t.test")
```
Then, let's try to use anova.
```{r, echo = TRUE, eval = FALSE}
t1$cal_diff(method = "anova")
# return t1$res_diff
t1$res_diff
```