microeco-tutorial.Rmd

--- 
title: "Tutorial for R microeco package (v0.14.1)"
author: "Chi Liu, Felipe R. P. Mansoldo, Umer Zeeshan Ijaz, Chenhao Li, Yang Cao, Jarrod J. Scott, Yaoming Cui, Alane B. Vermelho, Minjie Yao, Xiangzhen Li"
date: "`r Sys.Date()`"
site: bookdown::bookdown_site
output: bookdown::gitbook
documentclass: book
bibliography: [book.bib, packages.bib]
biblio-style: apalike
link-citations: yes
github-repo: rstudio/bookdown-demo
description: "The tutorial for R microeco, file2meco, meconetcomp and mecodev packages"
---


# Background

　R language [@R-base] and its packages ecosystem are wonderful tools for data analysis.
In community ecology, a series of packages are available for statistical analysis, 
such as vegan [@Jari_vegan_2019], ape [@Paradis_ape_2018] and picante [@Picante_Kembel_2010].
However, with the development of the high-throughput sequencing techniques,
the increasing data amount and complexity of studies make the data mining in microbiome a challenge.
There have been some R packages created specifically for the statistics and visualization of microbiome data, 
such as phyloseq [@Mcmurdie_phyloseq_2013],
microbiome (https://github.com/microbiome/microbiome), microbiomeSeq (http://www.github.com/umerijaz/microbiomeSeq), 
ampvis2 (https://github.com/KasperSkytte/ampvis2), MicrobiomeR(https://github.com/vallenderlab/MicrobiomeR),
theseus [@Price_theseus_2018], rANOMALY [@Theil_rANOMALY_2021], 
tidyMicro [@Carpenter_tidyMicro_2021], microbial (https://github.com/guokai8/microbial), 
amplicon (https://github.com/microbiota/amplicon),
MicrobiotaProcess (https://github.com/YuLab-SMU/MicrobiotaProcess) 
and so on.
In addition, some web tools associated with R language are also useful for microbiome data analysis, 
such as Shiny-phyloseq [@McMurdie_Shiny_2015], MicrobiomeExplorer [@Reeder_MicrobiomeExplorer_2021], 
animalcules [@Zhao_animalcules_2021] and Namco [@Dietrich_Namco_2022].
Even so, researchers still lack a flexible, comprehensive and modularized R package to analyze and manage the data fast and easily.
Based on this background, we created the R microeco package [@Liu_microeco_2021] (https://github.com/ChiLiubio/microeco).
Besides, we also developed the file2meco package (https://github.com/ChiLiubio/file2meco) for the data input from some famous tools easily.

<!--chapter:end:index.Rmd-->

# Introduction {#intro}

The microeco package has several advantages compared to other packages in R.
The main goal of developing this package is to help users analyze microbiome data fast.
So a series of commonly-used and cutting-edge approaches are implemented.
To facilitate the data mining, the whole structure of microeco package are highly modularized to
make users conveniently remember, search and use the classes.
It is notable that, beside the demonstration in the tutorial, users can also save the intermediate files in each object and
apply those files to other tools according to the format requirement.
Main files stored in the object of each class are the frequently-used data.frame format.
So the intermediate and result files are easily saved, modified and used for other tools in microbial ecology.
Before starting the specific usage of each class, let's first introduce several key points.

## Framework

This is a rough framework for users to fast understand the design of microeco package. 

```{r, out.width = "800px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/microeco_framework.png")
```

The stored 'Functions' and 'Files' represent that the user can access those functions or files in R6 object using $ operator 
as shown in the figure. An example is the function `dataset$cal_alphadiv()` and its return result `dataset$alpha_diversity`.
The dataset is a microtable object.
Generally, the return files of functions are named with the prefix 'res_' to make users easily find them when using Rstudio and the keyboard shortcuts (Tab).
Except for microtable class, the transformed data in created object is generally named with the prefix 'data_'.


## R6 Class

All the main classes in microeco package depend on the R6 class [@R6_Winston]. 
R6 uses the encapsulated object-oriented (OO) programming paradigm, 
which means that R6 is a profoundly different OO system from S3 and S4 because it is built on encapsulated objects, rather than generic functions. 
If the user is interested in the class features, read more from 'Advanced R' book (https://adv-r.hadley.nz/).

  + A generic is a regular function, so it lives in the global namespace. An R6 method belongs to an object so it lives in a local namespace. 
This influences how we think about naming. The methods belong to objects, not generics, and the user can call them like object$method().

  + R6’s reference semantics allow methods to simultaneously return a value and modify an object.
  
  + Every R6 object has an S3 class that reflects its hierarchy of R6 class.

## Help

The usage of help documents in the microeco package may be a little different from other packages we often used.
If the user wish to see the help document of a function, please search the name of the class it belongs to (not the name of the function) 
and click the link of the function.

```{r, echo = TRUE}
# first install microeco, see https://github.com/ChiLiubio/microeco
# load package microeco
library(microeco)
```

```{r, echo = TRUE, eval = FALSE}
# this can show all the functions and the detailed descriptions of the microtable class
# same with: help(microtable)
?microtable
```

## RTools
For Windows system, RTools (https://cran.r-project.org/bin/windows/Rtools/) is necessary to install some R packages from source, such as R packages deposited in GitHub.


## Dependence

### Important packages
To keep the start and use of microeco package simplified, 
the installation of microeco only depend on several packages, which are compulsory-installed from CRAN and frequently used in the data analysis.
So the question is that the user may encounter an error when using a class or function that invoke an additional package like this:

```{r, echo = TRUE, eval = FALSE}
library(microeco)
data(dataset)
t1 <- trans_network$new(dataset = dataset, filter_thres = 0.001)
t1$cal_network(network_method = "SpiecEasi")
```

```html
Error in t1$cal_network(network_method = "SpiecEasi"): igraph package not installed ...
```

<br>
The reason is that network construction requires igraph package. We donot put the igraph and some other packages on the "Imports" part of microeco package.
In addition, some packages, e.g. SpiecEasi, are released on github and can not be installed automatically.

The solutions:

1. Install the package when encounter such an error. Actually, it's very easy to install the packages from CRAN or bioconductor or github. Just try it.

2. Install the packages in advance. 
This is recommended if the user is interested in most of the methods of microeco package and want to run a large number of examples in this tutorial.


### CRAN packages

We show some packages that are published in CRAN and not installed automatically.
Those packages are necessary to reproduce some parts of the tutorial.

```{r, echo = FALSE, eval = TRUE}
pre_pac <- read.delim("prerequite_packages.tsv")
```

```{r, echo = FALSE, eval = TRUE}
knitr::kable(pre_pac)
```

Then, if you want to install these packages or some of them, you can do like this:

```{r, echo = TRUE, eval = FALSE}
# If a package is not installed, it will be installed from CRAN.
# First select the packages of interest
packages <- c("MASS", "GUniFrac", "ggpubr", "randomForest", "ggdendro", "ggrepel", "agricolae", "picante", "pheatmap", "igraph", "rgexf", 
    "ggalluvial", "ggh4x", "rcompanion", "FSA", "gridExtra", "aplot", "NST", "GGally")
# Now check or install
for(x in packages){
	if(!require(x, character.only = TRUE)) {
		install.packages(x, dependencies = TRUE)
	}
}
```

### ggtree
Plotting the cladogram from LEfSe result requires the ggtree package in bioconductor (https://bioconductor.org/packages/release/bioc/html/ggtree.html).
```{r, echo = TRUE, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("ggtree")
```

### Gephi
Gephi is an excellent network visualization tool and used to open the saved network file, 
i.e. network.gexf in the [tutorial](https://chiliubio.github.io/microeco_tutorial/model-based-class.html#trans_network-class).
You can download Gephi and learn how to use it from https://gephi.org/users/download/


### Tax4Fun
Tax4Fun is an R package used for predicting the functional potential of prokaryotic communities.

1. install Tax4Fun package
```{r, echo = TRUE, eval = FALSE}
install.packages("RJSONIO")
install.packages(system.file("extdata", "biom_0.3.12.tar.gz", package="microeco"), repos = NULL, type = "source")
install.packages(system.file("extdata", "qiimer_0.9.4.tar.gz", package="microeco"), repos = NULL, type = "source")
install.packages(system.file("extdata", "Tax4Fun_0.3.1.tar.gz", package="microeco"), repos = NULL, type = "source")
```

2. download SILVA123 reference data from http://tax4fun.gobics.de/  
　unzip SILVA123.zip and provide this path to the folderReferenceData parameter of cal_tax4fun function in trans_func class.


### Tax4Fun2
Tax4Fun2 is another R package for the the prediction of functional profiles and functional gene redundancies of prokaryotic communities [@Wemheuer_Tax4Fun2_2020].
It has higher accuracies than PICRUSt and Tax4Fun. The Tax4Fun2 approach implemented in microeco is a little different from the original package.
Using Tax4Fun2 approach require the representative fasta file.
The user do not need to install Tax4Fun2 R package again.
The only thing need to do is to download the blast tool (**ignore this if the blast tool has been in the path**) and Ref99NR/Ref100NR database (select one).
Download blast tools from "ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+" ; e.g. ncbi-blast-\*\*\*\*-x64-win64.tar.gz  for windows system.
Note that some errors can come from the latest versions because of memory issue (https://www.biostars.org/p/413294/). 
An easy solution is to use previous version (such as 2.5.0).
Download Ref99NR.zip from "https://cloudstor.aarnet.edu.au/plus/s/DkoZIyZpMNbrzSw/download"  or Ref100NR.zip from "https://cloudstor.aarnet.edu.au/plus/s/jIByczak9ZAFUB4/download" .
Uncompress all the folders. The final folders should be like these structures:

blast tools:  
　|-- ncbi-blast-2.5.0+  
　　|---- bin  
　　　|------ blastn.exe  
　　　|------ makeblastdb.exe  
　　　|------ ......  

Ref99NR:  
　|-- Tax4Fun2_ReferenceData_v2  
　　|---- Ref99NR  
　　　|------ otu000001.tbl.gz  
　　　|------ ......  
　　　|------ Ref99NR.fasta  
　　　|------ Ref99NR.tre  

The path "Tax4Fun2_ReferenceData_v2" will be required in the trans_func$cal_tax4fun2() function.
The blast tool path "ncbi-blast-2.5.0+/bin" is also required if it is not added to the system env path (environmental variable).


```{r, echo = TRUE, eval = FALSE}
# Either seqinr or Biostrings package should be installed for reading and writing fasta file
install.packages("seqinr", dependencies = TRUE)
# or install Biostrings from bioconductor https://bioconductor.org/packages/release/bioc/html/Biostrings.html
# Now we show how to read the fasta file
# see https://github.com/ChiLiubio/file2meco to install file2meco
rep_fasta_path <- system.file("extdata", "rep.fna", package="file2meco")
rep_fasta <- seqinr::read.fasta(rep_fasta_path)
# or use Biostrings package
rep_fasta <- Biostrings::readDNAStringSet(rep_fasta_path)
# try to create a microtable object with rep_fasta
data("otu_table_16S")
# In microtable class, all the taxa names should be necessarily included in rep_fasta
otu_table_16S <- otu_table_16S[rownames(otu_table_16S) %in% names(rep_fasta), ]
test <- microtable$new(otu_table = otu_table_16S, rep_fasta = rep_fasta)
test
```


## Plot
Most of the plots in the package rely on the ggplot2 package system.
We provide some parameters to optimize the corresponding plot, but it may be far from enough.
The user can also assign the output a name and use the ggplot2-style grammers to modify it.
Each data table used for visualization is stored in the object and can be saved for the customized analysis.
Of course, the user can also directly modify the class and reload them to use.
Any contribution of a modified class is appreciated via Github-Pull requests (https://github.com/ChiLiubio/microeco_tutorial/pulls) or Email (liuchi0426@126.com).


<!--chapter:end:01-intro.Rmd-->

# Basic class
The microtable class is the basic class.
All the other classes depend on the microtable class.

```{r, out.width = "8000px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/microtable_framework.png")
```

The objects inside the rectangle with full line represent functions.
The red rectangle means it is extremely important function.
The dashed line denotes the key objects (input or output of functions) that deserve more attention. 


## microtable class

　Many tools can be used for the bioinformatic analysis of amplicon sequencing data, such as QIIME [@Caporaso_QIIME_2010], QIIME2 [@Bolyen_Reproducible_2019], 
usearch (https://www.drive5.com/usearch/), mothur [@Schloss_Introducing_2009], 
SILVAngs (https://ngs.arb-silva.de/silvangs/), 
and RDP (http://rdp.cme.msu.edu/).
Although the formats of result files may vary across tools, the main contents can be generally classified into the following parts:
(1) OTU/ASV table, i.e. the feature-sample abundance table; 
(2) taxonomic assignment table;
(3) representative sequences;
(4) phylogenetic tree;
(5) metadata. 
It is generally useful to create a detailed sample metadata table to store all the sample information (including the environmental data).

　The microtable class is the basic class and designed to store the basic data for all the downstream analysis in the microeco package.
At least, the OTU table (i.e. feature-sample abundance table) should be provided to create microtable object.
Thus, the microtable class can determine that the sample information table is missing and create a default sample table according to
sample names in otu_table.
To make the file input more convenient,
we also build another R package file2meco (https://github.com/ChiLiubio/file2meco) to read the output files of some tools into microtable object.
Currently, those tools/softwares include not only commonly-used QIIME [@Caporaso_QIIME_2010] and QIIME2[@Bolyen_Reproducible_2019], 
but also several metagenomic tools, such as HUMAnN [@Franzosa_Species_2018] and kraken2 [@Wood_Improved_2019].
In this tutorial, the data inside the package was employed to show some basic operations.


### Prepare the example data

　The 16S rRNA gene sequencing results in the example data of the package is used to show the main part of the tutorial.
This dataset is the 16S rRNA gene Miseq sequencing results of wetland soils in China published by An et al. [@An_Soil_2019], 
who surveyed soil prokaryotic communities in Chinese inland wetlands (IW), 
coastal wetland (CW) and Tibet plateau wetlands (TW) using amplicon sequencing.
These wetlands include both saline and non-saline samples (classified for the tutorial).
The sample information table has 4 columns: "SampleID", "Group", "Type" and "Saline".
The column "SampleID" is same with the rownames.
The column "Group" represents the IW, CW and TW.
The column "Type" means the sampling region: northeastern region (NE), northwest region (NW), North China area (NC),
middle-lower reaches of the Yangtze River (YML), southern coastal area (SC), upper reaches of the Yangtze River (YU), Qinghai-Tibet Plateau (QTP). 
The column "Saline" denotes the saline soils and non-saline soils.
In this dataset, the environmental factor table is separated from the sample information table.
It is also recommended to put all the environmental data into sample information table.


```{r, echo = TRUE}
library(microeco)
# load the example data; 16S rRNA gene amplicon sequencing dataset
# metadata table; data.frame
data(sample_info_16S)
# feature table; data.frame
data(otu_table_16S)
# taxonomic assignment table; data.frame
data(taxonomy_table_16S)
# phylogenetic tree; not necessary; use for the phylogenetic analysis
# Newick format; use read.tree function of ape package to read a tree
data(phylo_tree_16S)
# load the environmental data table if it is not in sample table
data(env_data_16S)
# use pipe operator in magrittr package
library(magrittr)
# fix the random number generation to make the results repeatable
set.seed(123)
# make the plotting background same with the tutorial
library(ggplot2)
theme_set(theme_bw())
```

Make sure that the data types of sample_table, otu_table and tax_table are all `data.frame` format as the following part shows.

```{r, echo = TRUE}
class(otu_table_16S)
```

```{r, echo = TRUE, eval = FALSE}
otu_table_16S[1:5, 1:5]
```

```{r, echo = FALSE}
pander::pander(otu_table_16S[1:5, 1:5])
```

```{r, echo = TRUE}
class(taxonomy_table_16S)
```

```{r, echo = TRUE, eval = FALSE}
taxonomy_table_16S[1:5, 1:3]
```

```{r, echo = FALSE}
pander::pander(taxonomy_table_16S[1:5, 1:3])
```

Generally, users' taxonomic table has some messy information, such as NA, unidentified and unknown.
These information can potentially influence the following taxonomic abundance calculation and other taxonomy-based analysis.
So it is usually necessary to clean this data using the `tidy_taxonomy` function.
Another very important result of this operation is to **unify the taxonomic prefix** automatically,
e.g., converting D_1__ to p__ for Phylum level or adding p__ to Phylum directly if no prefix is found.


```{r, echo = TRUE, eval = FALSE}
# make the taxonomic information unified, very important
taxonomy_table_16S %<>% tidy_taxonomy
```

The rownames of sample_table in microtable object (i.e. sample names) are used for selecting samples/groups in all the related operations in the package.
Using pure number as sample names is **not recommended** in case of unknown disorder or man-made mistake.
**Before creating microtable object, make sure that the rownames of sample information table are sample names**.


```{r, echo = TRUE}
class(sample_info_16S)
```

```{r, echo = TRUE, eval = FALSE}
sample_info_16S[1:5, ]
```

```{r, echo = FALSE}
pander::pander(sample_info_16S[1:5, ])
```

In this example, the environmental data is stored in the env_data_16S alone. 
The user can also directly integrate those data into the sample information table.

```{r, echo = TRUE}
class(env_data_16S)
```

```{r, echo = FALSE}
pander::pander(env_data_16S[1:5, 1:5])
```

```{r, echo = TRUE}
class(phylo_tree_16S)
```

Then, we create an object of microtable class.
This operation is very similar with the package phyloseq[@Mcmurdie_phyloseq_2013], but in microeco it is more brief.
The otu_table in the microtable class must be the feature-sample format: rownames - OTU/ASV/pathway/other names; colnames - sample names.
**The colnames in otu_table must have overlap with rownames of sample_table**.
Otherwise, the following check can filter all the samples of otu_table because of no same sample names between otu_table and sample_table.


```{r, echo = TRUE}
# In R6 class, '$new' is the original method used to create a new object of class
# If you only provide abundance table, the class can help you create a sample info table
dataset <- microtable$new(otu_table = otu_table_16S)
class(dataset)
# generally add the metadata
dataset <- microtable$new(otu_table = otu_table_16S, sample_table = sample_info_16S)
dataset
# Let's create a microtable object with more information
dataset <- microtable$new(sample_table = sample_info_16S, otu_table = otu_table_16S, tax_table = taxonomy_table_16S, phylo_tree = phylo_tree_16S)
dataset
```

### How to read your files to microtable object?

The above-mentioned example data are directly loaded from microeco package.
So the question is __how to read your data to create a microtable object?__  
There are two ways:

▲ 1. __Use file2meco package__  
R package file2meco (https://chiliubio.github.io/microeco_tutorial/file2meco-package.html) is designed to directly read the output files of some famous tools into microtable object.
Currently, it supports QIIME [@Caporaso_QIIME_2010], QIIME2[@Bolyen_Reproducible_2019], 
HUMAnN [@Franzosa_Species_2018], MetaPhlAn [@Truong_MeTApHLaN2_2015], kraken2 [@Wood_Improved_2019], phyloseq [@Mcmurdie_phyloseq_2013], etc.
Please read the tutorial of file2meco package for more detailed information (https://chiliubio.github.io/microeco_tutorial/file2meco-package.html).

▲ 2. __Other cases__  
To transform customized files to microtable object, 
there should be two steps:  
__I) read files to R__  
The required format of microtable\$new parameters, __otu_table__, __sample_table__ and __tax_table__, are all the data.frame, which is the most frequently-used data format in R.
So no matter what the format the files are, they should be first read into R with some functions, such as `read.table` and `read.csv`.
If the user want to perform phylogenetic analysis, please also read your phylogenetic tree using `read.tree` function of ape package and 
provide the tree to the __phylo_tree__ parameter of microtable\$new function like the above example.  
__II) create the microtable object__  
Then the user can create the microtable object like the operation in the last section.
Please also see the help document of the microtable class for detailed descriptions using the following help command.

```{r, echo = TRUE, eval = FALSE}
# search the class name, not the function name
?microtable
# then see microtable$new()
```

### Functions in microtable class

Then, we remove OTUs which are not assigned in the Kingdom "k__Archaea" or "k__Bacteria". 
```{r, echo = TRUE}
# use R subset function to filter taxa in tax_table
dataset$tax_table %<>% base::subset(Kingdom == "k__Archaea" | Kingdom == "k__Bacteria")
# another way with grepl function
dataset$tax_table %<>% .[grepl("Bacteria|Archaea", .$Kingdom), ]
dataset
```

We also remove OTUs with the taxonomic assignments "mitochondria" or "chloroplast".

```{r, echo = TRUE}
# This will remove the lines containing the taxa word regardless of taxonomic ranks and ignoring word case in the tax_table.
# So if you want to filter some taxa not considerd pollutions, please use subset like the previous operation to filter tax_table.
dataset$filter_pollution(taxa = c("mitochondria", "chloroplast"))
dataset
```

To make the OTU and sample information consistent across all files in the dataset object, we use function `tidy_dataset` to trim the dataset.

```{r, echo = TRUE}
dataset$tidy_dataset()
print(dataset)
```

Then let's use sample_sums() to check the sequence numbers in each sample.

```{r, echo = TRUE}
dataset$sample_sums() %>% range
```

Sometimes, in order to reduce the effects of sequencing depth on the diversity measurements, 
it is optional to perform the resampling to make the sequence number equal for each sample.
The function `rarefy_samples` can invoke the function `tidy_dataset` automatically before and after the rarefying.

```{r, echo = TRUE}
# As an example, use 10000 sequences in each sample
dataset$rarefy_samples(sample.size = 10000)
dataset$sample_sums() %>% range
```

Then, let's calculate the taxa abundance at each taxonomic rank using `cal_abund()`.
This function **generate a list called taxa_abund stored in the microtable object**. 
This list contain several data frame of the abundance information at each taxonomic rank.
It's worth noting that the `cal_abund()` function can be used to **solve more complicated cases with special parameters**,
such as supporting both the relative and absolute abundance calculation and selecting the partial 'taxonomic' columns.
Those have been shown in file2meco package part (https://chiliubio.github.io/microeco_tutorial/file2meco-package.html#humann-metagenomic-results) with complex metagenomic dataset.

```{r, echo = TRUE}
# use default parameters
dataset$cal_abund()
# return dataset$taxa_abund
class(dataset$taxa_abund)
```

```{r, echo = TRUE, eval = FALSE}
# show part of the relative abundance at Phylum level
dataset$taxa_abund$Phylum[1:5, 1:5]
```

```{r, echo = FALSE}
pander::pander(dataset$taxa_abund$Phylum[1:5, 1:5])
```


The function `save_abund()` can be used to save the taxa abundance file to a local place easily.

```{r, echo = TRUE, eval = FALSE}
dataset$save_abund(dirpath = "taxa_abund")
```

Then, let's calculate the alpha diversity.
The result is also stored in the object microtable automatically.


```{r, echo = TRUE}
# If you want to add Faith's phylogenetic diversity, use PD = TRUE, this will be a little slow
dataset$cal_alphadiv(PD = FALSE)
# return dataset$alpha_diversity
class(dataset$alpha_diversity)
```

```{r, echo = TRUE, eval = FALSE}
# save dataset$alpha_diversity to a directory
dataset$save_alphadiv(dirpath = "alpha_diversity")
```

Let's go on to beta diversity with function `cal_betadiv()`.
If method parameter is not provided, the function automatically calculates Bray-curtis, Jaccard, weighted Unifrac and unweighted unifrac matrixes.


```{r, echo = FALSE, eval = TRUE, message = FALSE}
invisible(dataset$cal_betadiv(unifrac = FALSE))
```

```{r, echo = TRUE, eval = FALSE}
# unifrac = FALSE means do not calculate unifrac metric
# require GUniFrac package installed
dataset$cal_betadiv(unifrac = TRUE)
# return dataset$beta_diversity
class(dataset$beta_diversity)
# save dataset$beta_diversity to a directory
dataset$save_betadiv(dirpath = "beta_diversity")
```


### subset of samples

We donnot provide a special function to filter samples in microtable object, as we think it is redundant.
**We recommend manipulating the sample_table in microtable object directly.**
For example, if you want to extract samples of 'CW' group, please do like this: 

```{r, echo = TRUE}
# remember first clone the whole dataset
# see https://chiliubio.github.io/microeco_tutorial/notes.html#clone-function
group_CW <- clone(dataset)
# select 'CW'
group_CW$sample_table <- subset(group_CW$sample_table, Group == "CW")
# or: group_CW$sample_table <- subset(group_CW$sample_table, grepl("CW", Group))
# use tidy_dataset to trim all the basic files
group_CW$tidy_dataset()
group_CW
```

### filter features

Please use filter_taxa function to filter the features with low abundance or occurrence frequency.
For other operations on the features, please directly manipulate the otu_table of your microtable object.

```{r, echo = TRUE}
# It is better to have a backup before filtering features
dataset_filter <- clone(dataset)
# mean relative abundance threshold 0.0001
# occurrence frequency 0.1; 10% samples have the target features
dataset_filter$filter_taxa(rel_abund = 0.0001, freq = 0.1)
```

### merge taxa or samples

Merging taxa according to a specific taxonomic rank level of tax_table can generate a new microtable object.
In the new microtable object, each feature in otu_table represents one taxon at the output level.

```{r, echo = TRUE}
test <- dataset$merge_taxa(taxa = "Family")
test
```

Similarly, merging samples according to a specific group of sample_table can also generate a new microtable object.

```{r, echo = TRUE}
test <- dataset$merge_samples(use_group = "Group")
test
```


### Other examples

In microtable$new, if auto_tidy = TRUE, the function can automatically use tidy_dataset to make all files uniform.
Then, all other functions in microtable will also do this. But if the user changes the file in microtable object,
the class can not recognize this modification, the user should use `tidy_dataset` function to manually trim the microtable object.

```{r, echo = TRUE, eval = TRUE}
test <- microtable$new(sample_table = sample_info_16S[1:40, ], otu_table = otu_table_16S, auto_tidy = FALSE)
test
test1 <- microtable$new(sample_table = sample_info_16S[1:40, ], otu_table = otu_table_16S, auto_tidy = TRUE)
test1
test1$sample_table %<>% .[1:10, ]
test1
test1$tidy_dataset()
test1
```

The phylogenetic tree can be read with `read.tree` function in ape package.

```{r, echo = TRUE, eval = FALSE}
# use the example data rep_phylo.tre in file2meco package https://chiliubio.github.io/microeco_tutorial/file2meco-package.html#qiime
phylo_file_path <- system.file("extdata", "rep_phylo.tre", package="file2meco")
tree <- ape::read.tree(phylo_file_path)
```

Other functions and examples are listed here.

```{r, echo = TRUE, eval = FALSE}
# clone a complete dataset named test
test <- clone(dataset)

# add the rownames of tax_table as the last column of tax_table directly; useful in some analysis, e.g. biomarker finding at OTU/ASV level.
ncol(test$tax_table)
test$add_rownames2taxonomy(use_name = "OTU")
ncol(test$tax_table)

# sum the abundance for each taxa; very useful for taxa abundance checking and filtering
test$taxa_sums()

# rename feature names in all the files of microtable object
test$rename_taxa(newname_prefix = "new_name_")
rownames(test$otu_table)[1:5]
rownames(test$tax_table)[1:5]

# output sample names in microtable object
test$sample_names()[1:5]

# output taxa names in microtable object
test$taxa_names()[1:5]
```


### Key points

  + sample_table: rownames of sample_table must be sample names used
  + otu_table: rownames must be feature names; colnames must be sample names
  + `microtable` class: creating microtable object requires at least one file input (otu_table)
  + `tidy_taxonomy()`: necessary to make taxonomic table have unified format
  + `tidy_dataset()`: necessary to trim files in microtable object
  + `add_rownames2taxonomy()`: add the rownames of tax_table as the last column of tax_table
  + `cal_abund()`: powerful and flexible to cope with complex cases in tax_table, see the parameters
  + taxa_abund: taxa_abund is a list stored in microtable object and have several data frame
  + beta_diversity: beta_diversity is a list stored in microtable object and have several distance matrix


<!--chapter:end:02-Basic_class.Rmd-->

# Composition-based class

The trans_abund class and trans_venn class are organised into the section 'Composition-based class', 
since they are mainly used to show the composition information of communities.


## trans_abund class
　The trans_abund class has several functions to visualize taxonomic abundance based on the ggplot2 package.


### Example

We first show the bar plot example.

```{r, echo = TRUE}
# create trans_abund object
# use 10 Phyla with the highest abundance in the dataset.
t1 <- trans_abund$new(dataset = dataset, taxrank = "Phylum", ntaxa = 10)
# t1 object now include the transformed abundance data t1$abund_data and other elements for the following plotting
```

As the sample number is large, we do not show the sample names in x axis and add the facet to show abundance according to groups.

```{r, echo = TRUE, eval = FALSE}
t1$plot_bar(others_color = "grey70", facet = "Group", xtext_keep = FALSE, legend_text_italic = FALSE)
# return a ggplot2 object
```

```{r, out.width = "750px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_bar.png")
```

Two or more facets are supported with the facet parameter from v0.14.0 by providing a vector with multiple elements.

```{r, echo = TRUE, eval = FALSE}
# require package ggh4x, first run install.packages("ggh4x") if not installed
t1$plot_bar(others_color = "grey70", facet = c("Group", "Type"), xtext_keep = FALSE, legend_text_italic = FALSE, barwidth = 1)
```

```{r, out.width = "750px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_bar_facet2.png")
```

The default operation can filter all the unclassified taxa (i.e. p__ or g__ in tax_table that has been processed by `tidy_taxonomy` function), 
as those unknown taxa are generally meaningless.
However sometimes, these unknown taxa may be meaningful for users.
For example, if one want to isolate some unknown species, it is valuable to check the abundance of those unknown taxa.
At this time, please see this topic (https://github.com/ChiLiubio/microeco/issues/165) to resolve the issue that how to show unknown taxa with hierarchical taxonomy classification.
The alluvial plot is also implemented in the plot_bar function with use_alluvium parameter.

```{r, echo = TRUE, eval = FALSE}
t1 <- trans_abund$new(dataset = dataset, taxrank = "Genus", ntaxa = 8)
# require ggalluvial package
# use_alluvium = TRUE make the alluvial plot, clustering =TRUE can be used to reorder the samples by clustering
# bar_type = "notfull" can discard 'others'; select another color palette
t1$plot_bar(bar_type = "notfull", use_alluvium = TRUE, clustering = TRUE, xtext_type_hor = FALSE, xtext_size = 6, color_values = RColorBrewer::brewer.pal(8, "Set2"))
```

```{r, fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_bar_allu.png")
```

The bar plot can also be performed with group mean values.

```{r, echo = TRUE, eval = FALSE}
# The groupmean parameter can be used to obtain the group-mean barplot.
t1 <- trans_abund$new(dataset = dataset, taxrank = "Phylum", ntaxa = 10, groupmean = "Group")
g1 <- t1$plot_bar(others_color = "grey70", legend_text_italic = FALSE)
g1 + theme_classic() + theme(axis.title.y = element_text(size = 18))
```

```{r, out.width = "400px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_bar_mean.png")
```

The box plot is an excellent way to intuitionally show abundance distribution across groups.

```{r, echo = TRUE, eval = FALSE}
# show 15 taxa at Class level
t1 <- trans_abund$new(dataset = dataset, taxrank = "Class", ntaxa = 15)
t1$plot_box(group = "Group")
```

```{r, out.width = "700px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_box.png")
```

Then we show the heatmap with the high abundant genera.

```{r, echo = TRUE, eval = FALSE}
# show 40 taxa at Genus level
t1 <- trans_abund$new(dataset = dataset, taxrank = "Genus", ntaxa = 40)
t1$plot_heatmap(facet = "Group", xtext_keep = FALSE, withmargin = FALSE)
```

```{r, out.width = "750px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_heatmap.png")
```

Line chart is very useful to show the abundance change of taxa along time, space or other gradients.

```{r, echo = TRUE, eval = FALSE}
t1 <- trans_abund$new(dataset = dataset, taxrank = "Genus", ntaxa = 5)
t1$plot_line()
t1 <- trans_abund$new(dataset = dataset, taxrank = "Genus", ntaxa = 5, group = "Type")
t1$plot_line(position = position_dodge(0.3), xtext_type_hor = TRUE)
```

```{r, out.width = "750px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_line_type.png")
```

Then, we show the pie chart with the group mean values.

```{r, echo = TRUE, eval = FALSE}
t1 <- trans_abund$new(dataset = dataset, taxrank = "Phylum", ntaxa = 6, groupmean = "Group")
# all pie chart in one row
t1$plot_pie(facet_nrow = 1)
```

```{r, out.width = "600px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_pie.png")
```

### Key points

  + trans_abund$new: creating trans_abund object can invoke taxa_abund in microtable for transformation
  + color_values parameter: color_values parameter in each function is used for colors selection
  + input_taxaname parameter: input_taxaname parameter in trans_abund$new can be used to select interested customized taxa instead of abundance-based selection
  + use_percentage parameter: use_percentage parameter in trans_abund$new - whether show the abundance percentage
  

## trans_venn class

The trans_venn class is developed for venn analysis, i.e. shared and unique taxa across samples/groups.

### Example

This part can be performed using samples or groups at OTU/ASV level or higher taxonomic level.
To analyze the unique and shared OTUs of groups,
we first merge samples according to the "Group" column of sample_table.

```{r, echo = TRUE, eval = FALSE}
# merge samples as one community for each group
dataset1 <- dataset$merge_samples(use_group = "Group")
# dataset1 is a new microtable object
# create trans_venn object
t1 <- trans_venn$new(dataset1, ratio = NULL)
t1$plot_venn()
```

```{r, out.width = "500px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/trans_venn_0.png")
```

```{r, echo = TRUE, eval = FALSE}
# create venn plot with more information
t1 <- trans_venn$new(dataset1, ratio = "seqratio")
t1$plot_venn()
# The integer is OTU number
# The percentage data is the sequence number/total sequence number
```

```{r, out.width = "500px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/trans_venn_1.png")
```

When the groups are too many to show with venn plot, using petal plot is better.

```{r, echo = TRUE, eval = FALSE}
# use "Type" column in sample_table
dataset1 <- dataset$merge_samples(use_group = "Type")
t1 <- trans_venn$new(dataset1)
t1$plot_venn(petal_plot = TRUE, petal_center_size = 50, petal_r = 1.5, petal_a = 3, petal_move_xy = 3.8, petal_color_center = "#BEBADA")
```

```{r, out.width = "500px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/trans_venn_2.png")
```

Another way to plot the results is to use plot_bar function, which is especially useful for a large number of samples/groups.
This way is generally called UpSet plot.

```{r, echo = TRUE, eval = FALSE}
tmp <- dataset$merge_samples(use_group = "Type")
tmp
t1 <- trans_venn$new(dataset = tmp)
# only show some sets with large intersection numbers
t1$data_summary %<>% .[.[, 1] > 20, ]
t1$plot_bar()
```

```{r, out.width = "800px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_venn_bar_type.png")
```

Generally, after getting the intersection results, we do not know who those shared or unique taxa are.
The composition of the unique or shared species may account for the different and similar parts of ecological characteristics across groups[@Mendes_Deciphering_2011].
So, it is interesting to further analyze the composition of unique and shared species. 
For this goal, we first transform the results of venn plot to the traditional feature-sample table, that is, another object of microtable class.

```{r, echo = TRUE, eval = TRUE}
dataset1 <- dataset$merge_samples(use_group = "Group")
t1 <- trans_venn$new(dataset1)
# transform venn results to the sample-species table, here do not consider abundance, only use presence/absence.
t2 <- t1$trans_comm(use_frequency = TRUE)
# t2 is a new microtable class, each part is considered a sample
class(t2)
```

We use bar plot to show the composition at the Genus level.

```{r, echo = TRUE, eval = FALSE}
# calculate taxa abundance, that is, the frequency
t2$cal_abund()
# transform and plot
t3 <- trans_abund$new(dataset = t2, taxrank = "Genus", ntaxa = 8)
t3$plot_bar(bar_type = "part", legend_text_italic = T, ylab_title = "Frequency (%)", xtext_type_hor = FALSE, color_values = RColorBrewer::brewer.pal(8, "Set2"),
	order_x = c("IW", "CW", "TW", "IW&CW", "IW&TW", "CW&TW", "IW&CW&TW"))
```


```{r, out.width = "650px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/trans_venn_bar.png")
```

We also try to use pie chart to show the compositions at the Phylum level.

```{r, echo = TRUE, eval = FALSE}
t3 <- trans_abund$new(dataset = t2, taxrank = "Phylum", ntaxa = 8)
t3$plot_pie(facet_nrow = 3, color_values = rev(c(RColorBrewer::brewer.pal(8, "Dark2"), "grey50")))
```

```{r, out.width = "800px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/trans_venn_pie.png")
```

### Key points

  + ratio parameter: ratio parameter in trans_abund$new control whether and what content appear below the taxa number in venn plot
  + return data: using trans_venn$new() return data_details and data_summary stored in trans_venn object for further ploting


<!--chapter:end:03-Composition-based_class.Rmd-->

# Diversity-based class

Diversity is one of the core topics in community ecology.
It refers to alpha diversity, beta diversity and gamma diversity.


## trans_alpha class

　Alpha diversity can be transformed and visualized using trans_alpha class.
Creating the object of trans_alpha class can invoke the alpha_diversity data stored in the microtable object.


### Example

Creating trans_alpha object have two return data.frame with prefix 'data_': `data_alpha` and `data_stat`.
The data_alpha is used for the following differential test and visualization.


```{r, echo = TRUE, eval = FALSE}
t1 <- trans_alpha$new(dataset = dataset, group = "Group")
# return t1$data_stat
t1$data_stat[1:5, ]
```

```{r, echo = FALSE}
t1 <- trans_alpha$new(dataset = dataset, group = "Group")
pander::pander(t1$data_stat[1:5, ])
```

Then, we test the differences among groups using Kruskal-Wallis Rank Sum Test (overall test when groups > 2), Wilcoxon Rank Sum Tests (for paired groups),
Dunn's Kruskal-Wallis Multiple Comparisons (for paired groups when groups > 2) and anova with multiple comparisons.

```{r, echo = TRUE, eval = FALSE}
t1$cal_diff(method = "KW")
# return t1$res_diff
t1$res_diff[1:5, ]
```

```{r, echo = FALSE}
suppressWarnings(t1$cal_diff(method = "KW"))
pander::pander(t1$res_diff[1:5, c(1:2, 4:7)])
```

```{r, echo = TRUE, eval = FALSE}
t1$cal_diff(method = "KW_dunn")
# return t1$res_diff
t1$res_diff[1:5, ]
```

```{r, echo = FALSE}
t1$cal_diff(method = "KW_dunn")
pander::pander(t1$res_diff[1:5, c(1, 3:8)])
```

```{r, echo = TRUE, eval = FALSE}
# more options
t1$cal_diff(method = "wilcox")
t1$cal_diff(method = "t.test")
```

Then, let's try to use anova.
```{r, echo = TRUE, eval = FALSE}
t1$cal_diff(method = "anova")
# return t1$res_diff
t1$res_diff
```

```{r, echo = FALSE}
t1$cal_diff(method = "anova")
pander::pander(t1$res_diff)
```

The multi-factor analysis of variance is also supported with the `formula` parameter, such as two-way anova. 

```{r, echo = TRUE, eval = FALSE}
t1 <- trans_alpha$new(dataset = dataset, group = "Group")
t1$cal_diff(method = "anova", formula = "Group+Type")
head(t1$res_diff)
# see the help document for the usage of formula
```


The plot_alpha function add the significance label by searching the results in **object$res_diff** instead of calculating the significance again.
Now, let us plot the mean and se of alpha diversity for each group, and add the anova result.

```{r, echo = TRUE, eval = FALSE}
t1$cal_diff(method = "anova")
t1$plot_alpha(measure = "Chao1")
t1$plot_alpha(measure = "Chao1", order_x_mean = TRUE, add_sig_text_size = 6)
```

```{r, out.width = "600px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_alpha_letter.png")
```

```{r, echo = TRUE, eval = FALSE}
t1$cal_diff(method = "wilcox")
t1$plot_alpha(measure = "Chao1", shape = "Group")
```

```{r, out.width = "600px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_alpha_wilcox.png")
```

Let's try to remove the ns in the label by operating the object$res_diff file.

```{r, echo = TRUE, eval = FALSE}
t1$res_diff %<>% base::subset(Significance != "ns")
t1$plot_alpha(measure = "Chao1", xtext_size = 15)
```

```{r, out.width = "600px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_alpha_wilcox_nons.png")
```

From the v0.12.0, the trans_alpha class supports the differential test of groups within each group by using the by_group parameter.

```{r, echo = TRUE, eval = FALSE}
t1 <- trans_alpha$new(dataset = dataset, group = "Type", by_group = "Group")
t1$cal_diff(method = "wilcox")
t1$plot_alpha(measure = "Shannon")
```

```{r, out.width = "600px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_alpha_wilcox_bygroup.png")
```


Scheirer Ray Hare test is a nonparametric test that is suitable for a two-way factorial experiment.

```{r, echo = TRUE, eval = FALSE}
# require rcompanion package to be installed
t1$cal_diff(method = "scheirerRayHare", formula = "Group+Type")
```

### Key points

  + trans_alpha$new: creating trans_alpha object can invoke alpha_diversity in microtable for transformation
  + cal_diff: formula parameter can be used for multi-factor analysis of variance
  + plot_alpha: the significance label comes from the object$res_diff


## trans_beta class

　The trans_beta class is developed for the beta diversity analysis, i.e. the dissimilarities among samples.
Beta diversity can be defined at different forms[@Tuomisto_diversity_2010] and can be explored with different ways[@Anderson_Navigating_2011].
We encapsulate some commonly-used approaches in microbial ecology[@Ramette_Multivariate_2007].
Note that the part of beta diversity related with environmental factors are placed into the trans_env class.
The distance matrix in beta_diversity list of microtable object will be invoked for transformation and ploting using trans_beta class when needed.
The analysis referred to the beta diversity in this class mainly include ordination, group distance, clustering and manova.


### Example

We first show the ordination using PCoA.

```{r, echo = TRUE, eval = TRUE}
# we first create an trans_beta object
# measure parameter can invoke the distance matrix in dataset$beta_diversity
t1 <- trans_beta$new(dataset = dataset, group = "Group", measure = "bray")
```

```{r, echo = TRUE, eval = FALSE}
# use PCoA as an example, PCA or NMDS is also available
t1$cal_ordination(ordination = "PCoA")
# t1$res_ordination is the ordination result list
class(t1$res_ordination)
# plot the PCoA result with confidence ellipse
t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_type = c("point", "ellipse"))
```

```{r, out.width = "650px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_ordination.png")
```

Try other interesting options in the plotting.

```{r, echo = TRUE, eval = FALSE}
t1$plot_ordination(plot_color = "Type", plot_type = "point")
t1$plot_ordination(plot_color = "Group", point_size = 5, point_alpha = .2, plot_type = c("point", "ellipse"), ellipse_chull_fill = FALSE)
t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_type = c("point", "centroid"))
t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_type = c("point", "ellipse", "centroid"))
t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_type = c("point", "chull"))
t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_type = c("point", "chull", "centroid"))
t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_type = c("chull", "centroid"))
t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_type = c("point", "chull", "centroid"), add_sample_label = "SampleID")
t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_type = "centroid")
t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_type = "centroid", centroid_segment_alpha = 0.9, centroid_segment_size = 1, centroid_segment_linetype = 1)
t1$plot_ordination(plot_type = c("point", "centroid"), plot_color = "Type", centroid_segment_linetype = 1)
t1$plot_ordination(plot_color = "Saline", point_size = 5, point_alpha = .2, plot_type = c("point", "chull"), ellipse_chull_fill = FALSE, ellipse_chull_alpha = 0.1)
t1$plot_ordination(plot_color = "Group") + theme(panel.grid = element_blank()) + geom_vline(xintercept = 0, linetype = 2) + geom_hline(yintercept = 0, linetype = 2)
```


Then we plot and compare the group distances.

```{r, echo = TRUE, eval = FALSE}
# calculate and plot sample distances within groups
t1$cal_group_distance(within_group = TRUE)
# return t1$res_group_distance
# perform Wilcoxon Rank Sum and Signed Rank Tests
t1$cal_group_distance_diff(method = "wilcox")
t1$plot_group_distance(boxplot_add = "mean")
```

```{r, out.width = "500px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_group_distance_within.png")
```

```{r, echo = TRUE, eval = FALSE}
# calculate and plot sample distances between groups
t1$cal_group_distance(within_group = FALSE)
t1$cal_group_distance_diff(method = "wilcox")
t1$plot_group_distance(boxplot_add = "mean")
```

```{r, out.width = "500px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_group_distance_between.png")
```

Clustering plot is also a frequently used method.

```{r, echo = TRUE, eval = FALSE}
# use replace_name to set the label name, group parameter used to set the color
t1$plot_clustering(group = "Group", replace_name = c("Saline", "Type"))
```

```{r, out.width = "550px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_clustering.png")
```


PerMANOVA[@Anderson_Austral_2001] is often used in the differential test of distances among groups.

```{r, echo = TRUE, eval = FALSE}
# manova for all groups when manova_all = TRUE
t1$cal_manova(manova_all = TRUE)
t1$res_manova
```

```{r, echo = FALSE}
t1$cal_manova(manova_all = TRUE)
pander::pander(t1$res_manova)
```

The parameter manova_all = FALSE can be used to calculate significance for each paired group.
```{r, echo = TRUE, eval = FALSE}
# manova for each paired groups
t1$cal_manova(manova_all = FALSE)
t1$res_manova
```

```{r, echo = FALSE}
t1$cal_manova(manova_all = FALSE)
pander::pander(t1$res_manova)
```

The parameter manova_set has higher priority than manova_all. If manova_set is provided, manova_all parameter will be disabled.
```{r, echo = TRUE, eval = FALSE}
# manova for specified group set: such as "Group + Type"
t1$cal_manova(manova_set = "Group + Type")
t1$res_manova
```

```{r, echo = FALSE}
t1$cal_manova(manova_set = "Group + Type")
pander::pander(t1$res_manova)
```


PERMDISP[@Anderson_Navigating_2011] is also implemented to check multivariate homogeneity of groups dispersions (variances).

```{r, echo = TRUE}
# for the whole comparison and for each paired groups
t1$cal_betadisper()
t1$res_betadisper
```

For the explanation of statistical methods in microbial ecology, please read the references [@Ramette_Multivariate_2007; @Buttigieg_guide_2014].

### Key points

  + trans_beta$new: creating trans_beta object with measure parameter can invoke beta_diversity in microtable object for transformation
  + cal_ordination(): PCoA, PCA and NMDS approaches are all available
  + cal_manova(): cal_manova function can be used for paired comparisons, overall test and multi-factor test
  + plot_group_distance(): manipulating object$res_group_distance_diff can control what statistical results are presented in the plot.


<!--chapter:end:04-Diversity-based_class.Rmd-->

# Model-based class

We group all the classes with complex models into the this section 'Model-based class'.


## trans_diff class

　Differential abundance test is a very important part in the microbial community data analysis [@Nearing_Microbiome_2022].
It can find the significant taxa in determining the community differences across groups.
There are also several wrapped methods to better capture the important biomarkers,
such as the combination of machine-learning and differential abundance test.
Currently, trans_diff class have several famous approaches to perform this analysis: 
metastat[@White_Statistical_2009], LEfSe[@Segata_Metagenomic_2011], RF (random forest + differential test), metagenomeSeq[@Paulson_Differential_2013],
Kruskal-Wallis Rank Sum Test (for groups > 2), Wilcoxon Rank Sum Tests (for each paired group) and 
Dunn's Kruskal-Wallis Multiple Comparisons (for paired group in cases groups > 2),
t.test, ANOVA, ANCOMBC [@Lin_Analysis_2020] and ALDEx2 [@Fernandes_Unifying_2014].


### Example

All the differential test result is stored in the object$res_diff.
LEfSe combines the non-parametric test and linear discriminant analysis [@Segata_Metagenomic_2011].

```{r, echo = TRUE, eval = FALSE}
t1 <- trans_diff$new(dataset = dataset, method = "lefse", group = "Group", alpha = 0.01, lefse_subgroup = NULL)
# see t1$res_diff for the result
# From v0.8.0, threshold is used for the LDA score selection.
t1$plot_diff_bar(threshold = 4)
# we show 20 taxa with the highest LDA (log10)
t1$plot_diff_bar(use_number = 1:30, width = 0.8, group_order = c("CW", "IW", "TW"))
```

```{r, out.width = "700px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_lefse_bar.png")
```

```{r, echo = TRUE, eval = FALSE}
# show part of the table
t1$res_diff[1:5, c(1, 3, 4, 6)]
```

```{r, echo = FALSE, eval = TRUE}
res_lefse <- read.csv("Files/res_lefse.csv")
pander::pander(res_lefse[1:5, c(1, 3, 4, 6)])
```

Then, we plot the abundance of biomarkers detected by LEfSe.

```{r, echo = TRUE, eval = FALSE}
t1$plot_diff_abund(use_number = 1:30, group_order = c("CW", "IW", "TW"))
```

```{r, out.width = "650px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_lefse_diff_abund.png")
```

Then, we show the cladogram of the differential features in the taxonomic tree.
There are too many taxa in this dataset.
As an example, we only select the highest 200 abundant taxa in the tree and 50 differential features.
We only show the full taxonomic label at Phylum level and use letters at other levels to reduce the text overlap.
**Note that if an error occurs in this function, the reason with a high probability is the chaotic taxonomy in the user's data**.
**Please see the tidy_taxonomy function of microtable part to solve this issue.**.

```{r, echo = TRUE, eval = FALSE}
# clade_label_level 5 represent phylum level in this analysis
# require ggtree package
t1$plot_diff_cladogram(use_taxa_num = 200, use_feature_num = 50, clade_label_level = 5, group_order = c("CW", "IW", "TW"))
```

```{r, out.width = "1000px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_lefse_cladogram.png")
```

There may be a problem related with the taxonomic labels in the plot.
When the levels used are too many, the taxonomic labels may have too much overlap.
However, if only Phylum labels are indicated, the taxa in the legend with marked letters are too many.
At this time, taxa can be manually choosed to show like the following operation.

```{r, echo = TRUE, eval = FALSE}
# choose some taxa according to the positions in the previous picture; those taxa labels have minimum overlap
use_labels <- c("c__Deltaproteobacteria", "c__Actinobacteria", "o__Rhizobiales", "p__Proteobacteria", "p__Bacteroidetes", 
	"o__Micrococcales", "p__Acidobacteria", "p__Verrucomicrobia", "p__Firmicutes", 
	"p__Chloroflexi", "c__Acidobacteria", "c__Gammaproteobacteria", "c__Betaproteobacteria", "c__KD4-96",
	"c__Bacilli", "o__Gemmatimonadales", "f__Gemmatimonadaceae", "o__Bacillales", "o__Rhodobacterales")
# then use parameter select_show_labels to show
t1$plot_diff_cladogram(use_taxa_num = 200, use_feature_num = 50, select_show_labels = use_labels)
# Now we can see that more taxa names appear in the tree
```

```{r, out.width = "1000px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_lefse_cladogram_1.png")
```

The 'rf' method depends on the random forest[@Beck_Machine_2014; @Yatsunenko_Human_2012] and the non-parametric test.
The current method can calculate random forest by bootstrapping like the method in LEfSe and only use the significant features.
MeanDecreaseGini is selected as the indicator value in the analysis.

```{r, echo = TRUE, eval = FALSE}
# use Genus level for parameter taxa_level, if you want to use all taxa, change to "all"
# nresam = 1 and boots = 1 represent no bootstrapping and use all samples directly
t1 <- trans_diff$new(dataset = dataset, method = "rf", group = "Group", taxa_level = "Genus")
# plot the MeanDecreaseGini bar
# group_order is designed to sort the groups
g1 <- t1$plot_diff_bar(use_number = 1:20, group_order = c("TW", "CW", "IW"))
# plot the abundance using same taxa in g1
g2 <- t1$plot_diff_abund(group_order = c("TW", "CW", "IW"), select_taxa = t1$plot_diff_bar_taxa)
# now the y axis in g1 and g2 is same, so we can merge them
# remove g1 legend; remove g2 y axis text and ticks
g1 <- g1 + theme(legend.position = "none")
g2 <- g2 + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
gridExtra::grid.arrange(g1, g2, ncol = 2, nrow = 1, widths = c(2, 1.7))
```

```{r, out.width = "800px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_rf_diff_abund.png")
```

The significance label can also be added in the abundance plot controlled by add_sig parameter and other related parameters.
Now adding labels supports all the differential test methods.

```{r, echo = TRUE, eval = FALSE}
t1 <- trans_diff$new(dataset = dataset, method = "wilcox", group = "Group", taxa_level = "Genus", filter_thres = 0.001)
# filter something not needed to show
t1$res_diff %<>% subset(Significance %in% "***")
t1$plot_diff_abund(use_number = 1:20, add_sig = T, add_sig_label = "Significance")
```

```{r, out.width = "700px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_diff_abund_siglabel_wilcox.png")
```

```{r, echo = TRUE, eval = FALSE}
t1 <- trans_diff$new(dataset = dataset, method = "anova", group = "Group", taxa_level = "Genus", filter_thres = 0.001)
t1$plot_diff_abund(use_number = 1:10, add_sig = T, coord_flip = F)
```

```{r, out.width = "700px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_diff_abund_siglabel_anova.png")
```

Metastat depends on the permutations and t-test and performs well on the sparse data for paired groups test.

```{r, echo = TRUE, eval = FALSE}
# metastat analysis at Genus level
t1 <- trans_diff$new(dataset = dataset, method = "metastat", group = "Group", taxa_level = "Genus")
# t1$res_diff is the differential test result
# t1$res_abund is the group abundance
```

Because the example 'Group' in sample_table has three groups, 
the metastat can run the comparisons for each paired group. So there are three pairs in t1\$res_diff\$Comparison.
For the abundance plotting, the user should use select_group to select the required pair.

```{r, echo = TRUE, eval = FALSE}
# select_group should be one of groups in t1$res_diff$Comparison
t1$plot_diff_abund(use_number = 1:20, select_group = "CW - TW")
```

```{r, out.width = "650px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_metastat_1.png")
```


The following are the examples for the methods 'KW', 'KW_dunn', 'wilcox', 't.test' and 'anova'.

```{r, echo = TRUE, eval = FALSE}
# Kruskal-Wallis Rank Sum Test for all groups (>= 2)
t1 <- trans_diff$new(dataset = dataset, method = "KW", group = "Group", taxa_level = "all", filter_thres = 0.001)
t1$plot_diff_abund(use_number = 1:20)
# Dunn's Kruskal-Wallis Multiple Comparisons when group number > 2; require FSA package
t1 <- trans_diff$new(dataset = dataset, method = "KW_dunn", group = "Group", taxa_level = "Genus", filter_thres = 0.0001)
t1$plot_diff_abund(use_number = 1:20, select_group = "CW - IW", group_order = c("CW", "IW"))
# Wilcoxon Rank Sum and Signed Rank Tests for all paired groups
t1 <- trans_diff$new(dataset = dataset, method = "wilcox", group = "Group", taxa_level = "Genus", filter_thres = 0.001)
t1$plot_diff_bar(use_number = 1:20, select_group = "CW - TW")
# t.test
t1 <- trans_diff$new(dataset = dataset, method = "t.test", group = "Group", taxa_level = "all", filter_thres = 0.001)
# anova
t1 <- trans_diff$new(dataset = dataset, method = "anova", group = "Group", taxa_level = "Phylum", filter_thres = 0.001)
t1$res_diff
```

The method 'metagenomeSeq' and 'ANCOMBC' depend on the metagenomeSeq package and ANCOMBC package, respectively.
The method 'ALDEx2_t' and 'ALDEx2_kw' depend on the ALDEx2 package.
These three packages are all deposited on the Bioconductor.

```{r, echo = TRUE, eval = FALSE}
# zero-inflated log-normal model-based differential test method from metagenomeSeq package
# please first install metagenomeSeq package
if(!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("metagenomeSeq")
t1 <- trans_diff$new(dataset = dataset, method = "metagenomeSeq", group = "Group", taxa_level = "Genus")
t1 <- trans_diff$new(dataset = dataset, method = "metagenomeSeq", group = "Group", taxa_level = "OTU")
t1$plot_diff_abund(use_number = 1:30, group_order = c("TW", "CW", "IW"))
t1$plot_diff_bar(use_number = 1:20)
```

```{r, echo = TRUE, eval = FALSE}
# 'ALDEx2_t' and 'ALDEx2_kw' methods; use ?trans_diff to see the description of the methods
BiocManager::install("ALDEx2")
# 'ALDEx2_t'
t1 <- trans_diff$new(dataset = dataset, method = "ALDEx2_t", group = "Group", taxa_level = "Phylum")
t1$plot_diff_abund(use_number = 1:20, group_order = c("TW", "CW", "IW"))
t1$plot_diff_abund(use_number = 1:20, select_group = "CW - TW")
t1$plot_diff_abund(use_number = 1:20, select_group = "CW - TW", add_sig = TRUE)
t1 <- trans_diff$new(dataset = dataset, method = "ALDEx2_t", group = "Group", taxa_level = "OTU", filter_thres = 0.0005)
# ALDEx2_kw
t1 <- trans_diff$new(dataset = dataset, method = "ALDEx2_kw", group = "Group", taxa_level = "Phylum")
t1$plot_diff_abund(use_number = 1:30, group_order = c("TW", "CW", "IW"))
t1$plot_diff_bar(use_number = 1:20)
t1$plot_diff_abund(use_number = 1:30, group_order = c("TW", "CW", "IW"), add_sig = TRUE)
```


```{r, echo = TRUE, eval = FALSE}
# ANCOMBC method
BiocManager::install("ANCOMBC")
t1 <- trans_diff$new(dataset = dataset, method = "ANCOMBC", group = "Group", taxa_level = "Family")
t1$plot_diff_abund(use_number = 1:20, select_group = "CW - TW")
t1$plot_diff_abund(use_number = 1:20, group_order = c("TW", "CW", "IW"), add_sig = TRUE)
t1$plot_diff_bar(use_number = 1:20)
```


### Key points

  + trans_diff\$new: In trans_diff\$new, p_adjust_method = NULL can shut down the p value adjustment for method != "metastat" or "anova". 
    This is useful in cases where very few significant taxa are found (generally no significant taxa found after adjustment) and 
	where the results need to be compared with that from Galaxy server or other LEfSe python version.
  + trans_diff$new: this class has a strict requirement on the taxonomic information, make sure `tidy_taxonomy()` function has been performed for the dataset$tax_table
  + trans_diff$new: creating this class will invoke taxa_abund stored in microtable object
  + trans_diff$plot_diff_cladogram: clade_label_size, clade_label_size_add and clade_label_size_log are used to control the text size in the cladogram


## trans_network class

　Network is a frequently used approach to study the co-occurrence patterns in microbial ecology [@Deng_Molecular_2012; @Faust_Microbial_2012; @Coyte_Theecology_2015].
In this part, we describe part of the implemented methods in the trans_network class.


```{r, out.width = "8000px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/network_framework.png")
```

The objects inside the rectangle with full line represent functions.
The dashed line denotes the key objects (input or output). The `res_network` inside the ellipse with dashed line means it is a hub object for other analysis.


### Example

We use the correlation-based network to show the main operations for users. 
This is only intended to show some operations conveniently. 
**Do not mean we are suggesting this approach in any case**.
Please check the final part for other network construction methods.

```{r, echo = TRUE, eval=FALSE}
# The parameter cor_method in trans_network is used for selecting the correlation calculation method.
# default pearson or spearman correlation invoke R base cor.test, a little slow
t1 <- trans_network$new(dataset = dataset, cor_method = "spearman", filter_thres = 0.001)
# return t1$res_cor_p list, containing two tables: correlation coefficient table and p value table
```

Those are other available correlation/association options, 
such as Bray–Curtis (1-dissimilarity), SparCC [@Friedman_Inferring_2012], CCLasso [@Fang_CCLasso_2015], 
Pearson or Spearman with data normalization based on NetCoMi package [@Peschel_NetCoMi_2021].

```{r, echo = TRUE, eval=FALSE}
# use Bray–Curtis index (1-dissimilarity)
t1 <- trans_network$new(dataset = dataset, cor_method = "bray", filter_thres = 0.001)
# Pearson correlation
t1 <- trans_network$new(dataset = dataset, cor_method = "pearson", filter_thres = 0.001)
# Pearson correlation using WGCNA package
# install WGCNA package
if(!require("WGCNA")) install.packages("WGCNA", repos = BiocManager::repositories())
t1 <- trans_network$new(dataset = dataset, cor_method = "pearson", use_WGCNA_pearson_spearman = TRUE, filter_thres = 0.001)
# Pearson correlation using NetCoMi package; install it from https://github.com/stefpeschel/NetCoMi
t1 <- trans_network$new(dataset = dataset, cor_method = "pearson", use_NetCoMi_pearson_spearman = TRUE, filter_thres = 0.001)
# Spearman correlation using WGCNA package
t1 <- trans_network$new(dataset = dataset, cor_method = "spearman", use_WGCNA_pearson_spearman = TRUE, filter_thres = 0.001)
# Spearman correlation using NetCoMi package
t1 <- trans_network$new(dataset = dataset, cor_method = "spearman", use_NetCoMi_pearson_spearman = TRUE, filter_thres = 0.001)
# SparCC method, from SpiecEasi package, see https://github.com/zdk123/SpiecEasi for the installation
t1 <- trans_network$new(dataset = dataset, cor_method = "sparcc", use_sparcc_method = "SpiecEasi", filter_thres = 0.001)
# SparCC method, from NetCoMi package; https://github.com/stefpeschel/NetCoMi
t1 <- trans_network$new(dataset = dataset, cor_method = "sparcc", use_sparcc_method = "NetCoMi", filter_thres = 0.001)
# CCLasso method based on NetCoMi package
t1 <- trans_network$new(dataset = dataset, cor_method = "cclasso", filter_thres = 0.001)
# CCREPE method based on NetCoMi package
t1 <- trans_network$new(dataset = dataset, cor_method = "ccrepe", filter_thres = 0.001)
```

We use Spearman correlation based on WGCNA package (use_WGCNA_pearson_spearman = TRUE) to show all the following operations.

```{r, echo = TRUE, eval=FALSE}
# require WGCNA package; installation command: install.packages("WGCNA", repos = BiocManager::repositories())
t1 <- trans_network$new(dataset = dataset, cor_method = "spearman", use_WGCNA_pearson_spearman = TRUE, filter_thres = 0.0001)
```

The parameter COR_cut can be used to select the correlation threshold. 
Furthermore, COR_optimization = TRUE can be used to find the optimized coefficient threshold (potential transition point of network eigenvalues) 
instead of the COR_cut based on the RMT theory [@Deng_Molecular_2012].

```{r, echo = TRUE, eval=FALSE}
# construct network; require igraph package
t1$cal_network(COR_p_thres = 0.01, COR_optimization = TRUE)
# use arbitrary coefficient threshold to contruct network
t1$cal_network(COR_p_thres = 0.01, COR_cut = 0.7)
# return t1$res_network
```

```{r, echo = TRUE, eval=FALSE}
# # modularity for this undirected network with igraph cluster_fast_greedy function
t1$cal_module(method = "cluster_fast_greedy")
```

```{r, echo = TRUE, eval=FALSE}
# save network
# open network.gexf file using Gephi(https://gephi.org/)
# require rgexf package
t1$save_network(filepath = "network.gexf")
```

For the current version of microeco package, the node properties have no abundance information.
If the user want to use abundance of features to scale up node size in the Gephi network plot,
please check the solution in GitHub Issues (https://github.com/ChiLiubio/microeco/issues/152).
Then, we plot the network and present the node colors according to the calculated modules in Gephi.

```{r, out.width = "550px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/network1_spearman.png")
```

Now, we show the node colors with the Phylum information and the edges colors with the positive and negative correlations.
All the data used has been stored in the network.gexf file, including modules classifications, Phylum information and edge labels.

```{r, out.width = "550px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/network2_spearman.png")
```


```{r, echo = TRUE, eval = FALSE}
# calculate network attributes
t1$cal_network_attr()
t1$res_network_attr
```

```{r, echo = FALSE}
# read
res_network_attr <- read.csv("Files/res_network_attr.csv", check.names = FALSE)
```
```{r, echo = FALSE}
pander::pander(res_network_attr)
```

The function get_node_table, get_edge_table and get_adjacency_matrix are designed to 
get node properties table, edge properties table and adjacency matrix from network, respectively.

```{r, echo = TRUE, eval = FALSE}
# get node properties
t1$get_node_table(node_roles = TRUE)
# return t1$res_node_table
```

```{r, echo = FALSE}
# read
network_node_table <- read.csv("Files/res_node_table.csv", row.names = 1, check.names = FALSE)
```

```{r, echo = FALSE}
pander::pander(network_node_table[1:5, c(1:6)])
```

```{r, echo = TRUE, eval = FALSE}
# get edge properties
t1$get_edge_table()
# return t1$res_edge_table 
t1$get_adjacency_matrix()
# return t1$res_adjacency_matrix
```

Then, let’s plot the node classification in terms of the within-module connectivity and among-module connectivity.

```{r, echo = TRUE, eval = FALSE}
# add_label = TRUE can be used to directly add text label for points
t1$plot_taxa_roles(use_type = 1)
```

```{r, out.width = "700px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_taxa_roles.png")
```

```{r, echo = TRUE, eval = FALSE}
# plot node roles with phylum information
t1$plot_taxa_roles(use_type = 2)
```

```{r, out.width = "800px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_taxa_roles_2.png")
```

Now, we show the eigengene analysis of modules.
The eigengene of a module, i.e. the first principal component of PCA, represents the main variance of the abundance in the species of the module.

```{r, echo = TRUE, eval = FALSE}
t1$cal_eigen()
# return t1$res_eigen
```

Then we perform correlation heatmap to show the relationships between eigengenes and environmental factors.

```{r, echo = TRUE, eval = FALSE}
# create trans_env object
t2 <- trans_env$new(dataset = dataset, add_data = env_data_16S[, 4:11])
# calculate correlations
t2$cal_cor(add_abund_table = t1$res_eigen)
# plot the correlation heatmap
t2$plot_cor()
```

```{r, out.width = "600px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/Env_module_eigen.png")
```

The subset_network() function can be used to extract a part of nodes and edges among these nodes from the network.
In this function, you should provide the nodes you need using the node parameter.

```{r, echo = TRUE, eval = FALSE}
# extract a sub network that contains all nodes in module M1
t1$subset_network(node = t1$res_node_table %>% base::subset(module == "M1") %>% rownames, rm_single = TRUE)
# return a new network with igraph class
# extract sub network in which all edge labels are "+", i.e. positive edges
t1$subset_network(edge = "+")
```

**Then let's show how to extract sub-network for samples and the whole analysis flow.**

```{r, echo = TRUE, eval = FALSE}
# extract the sub-network of sample 'S1'
sub1 <- t1$subset_network(node = dataset$otu_table %>% .[.[, "S1"] != 0, ] %>% rownames, rm_single = TRUE)
# see https://chiliubio.github.io/microeco_tutorial/notes.html#clone for the 'clone' function explanation
t2 <- clone(t1)
t2$res_network <- sub1
# then t2 have a network for 'S1' and can be used for further analysis
t2$cal_module()
t2$save_network("S1.gexf")
# please use a loop for more samples
```

We also add the function `plot_network` to directly plot the network in R, including the static network and dynamic network.
The static network is suitable for the case with relatively few nodes, while dynamic network can be better applied to a large network.
See https://yunranchen.github.io/intro-net-r/advanced-network-visualization.html and https://kateto.net/network-visualization for more
details on the network visualization in R.

```{r, echo = TRUE, eval = FALSE}
# default parameter represents using igraph plot.igraph function
t2$plot_network()
# use ggraph method; require ggraph package
# If ggraph is not installed; first install it with command: install.packages("ggraph")
t2$plot_network(method = "ggraph", node_color = "Phylum")
# use networkD3 package method for the dynamic network visualization in R
# If networkD3 is not installed; first install it with command: install.packages("networkD3")
t1$plot_network(method = "networkD3", node_color = "module")
t1$plot_network(method = "networkD3", node_color = "Phylum")
```

The trans_comm function can be used to transform the node classification to a new microtable object for other analysis.

```{r, echo = TRUE, eval = FALSE}
# use_col is used to select a column of t1$res_node_table
tmp <- t1$trans_comm(use_col = "module", abundance = FALSE)
tmp
tmp$otu_table[tmp$otu_table > 0] <- 1
tmp$tidy_dataset()
tmp$cal_abund()
tmp2 <- trans_abund$new(tmp, taxrank = "Phylum", ntaxa = 10)
tmp2$data_abund$Sample %<>% factor(., levels = rownames(tmp$sample_table))
tmp2$plot_line(xtext_type_hor = TRUE, color_values = RColorBrewer::brewer.pal(12, "Paired")) + ylab("OTUs ratio (%)")
```

The function cal_sum_links can sum the links (edge) number from one taxa to another or within the same taxa. 
The function plot_sum_links() is used to show the result from the function cal_sum_links(). 
This is very useful to fast see how many nodes are connected between different taxa or within one taxa. 
In terms of 'Phylum' level in the tutorial, the function cal_sum_links() sum the linkages number from one Phylum to another Phylum or the linkages in the same Phylum. 
So the numbers along the outside of the circular plot represent how many edges or linkages are related with the Phylum. 
For example, in terms of Proteobacteria, there are roughly total 900 edges associated with the OTUs in Proteobacteria, 
in which roughly 200 edges connect both OTUs in Proteobacteria and roughly 150 edges connect the OTUs from Proteobacteria with the OTUs from Chloroflexi.

```{r, echo = TRUE, eval = FALSE}
t1$cal_sum_links(taxa_level = "Phylum")
# require chorddiag package; see https://github.com/mattflor/chorddiag
t1$plot_sum_links(plot_pos = TRUE, plot_num = 10)
```

```{r, out.width = "700px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_sum_links.png")
```


**Then let's show other implemented network construction approaches**:  
SPIEC-EASI (SParse InversE Covariance Estimation for Ecological Association Inference) approach of SpiecEasi R package [@Kurtz_Sparse_2015]
has two network construction approaches based on graph model, which relies on algorithms for sparse neighborhood and inverse covariance selection.
See https://github.com/zdk123/SpiecEasi for the package installation.
It is very slow for SpiecEasi_method = 'glasso' when there is a large number (such as hundreds to thousands) according to our test experience.

```{r, echo = TRUE, eval=FALSE}
t1 <- trans_network$new(dataset = dataset, cor_method = NULL, taxa_level = "OTU", filter_thres = 0.001)
# require SpiecEasi package installed https://github.com/zdk123/SpiecEasi
# also see SpiecEasi::spiec.easi for available model parameters
t1$cal_network(network_method = "SpiecEasi", SpiecEasi_method = "mb")
# see t1$res_network
```

Another network construction approach comes from julia package FlashWeave [@Tackmann_Rapid_2019].
This is a probabilistic graph-based method to obtain the conditional independence. 
It predicts direct associations among microbes from large-scale compositional abundance data through statistical co-occurrence. 
To repeat the following code, please first install julia language in your computer and the FlashWeave package, and add the julia in the computer path.


1. download and install julia from https://julialang.org/downloads/  
2. Put julia in the computer env PATH, such as  your_directory_path\Julia\bin  
3. Open terminal or cmd or Powershell, open julia, install FlashWeave following the operation in https://github.com/meringlab/FlashWeave.jl  

```{r, echo = TRUE, eval=FALSE}
t1 <- trans_network$new(dataset = dataset, cor_method = NULL, taxa_level = "OTU", filter_thres = 0)
# require Julia in the computer path, and the package FlashWeave
# different with the direct parameter passing of 'SpiecEasi' network_method, FlashWeave_other_para is used to pass parameters to Julia FlashWeave
# assign FlashWeave_tempdir parameter can change the temporary working directory
t1$cal_network(network_method = "FlashWeave", FlashWeave_other_para = "alpha=0.01,sensitive=true,heterogeneous=true")
# see t1$res_network
```

The final method we want to show comes from beemStatic package [@Li_BEEMStatic_2021].
This method can be applied to cross-sectional datasets to infer interaction network based on the generalized Lotka-Volterra model,
which is typically used in the microbial time-series data. 
So the network from this approach is a directed network.
Please see https://github.com/CSB5/BEEM-static for installing the R beemStatic package.

```{r, echo = TRUE, eval=FALSE}
t1 <- trans_network$new(dataset = dataset, cor_method = NULL, taxa_level = "OTU", filter_thres = 0.001)
# require beemStatic package installed
t1$cal_network(network_method = "beemStatic")
# we use cluster_optimal method for the directed network
t1$cal_module(method = "cluster_optimal")
```

### Network comparison
To compare different networks from trans_network class,
please see the meconetcomp package part (https://chiliubio.github.io/microeco_tutorial/meconetcomp-package.html).


### Key points

  + cal_network(): get a network named res_network based on different methods
  + get_node_table(): get node properties table
  + subset_network(): this function can extract any sub-network according to the input nodes, e.g. sub-network for modules or samples

### Other functions

  + cal_powerlaw(): perform bootstrapping hypothesis test to determine whether degrees follows a power law distribution and fit degrees to a power law distribution.


## trans_nullmodel class
  In recent decades, 
the integration of phylogenetic analysis and null model promotes the inference of niche and neutral influences on community assembly more powerfully 
by adding a phylogeny dimension [@Webb_Phylogenies_2002; @Picante_Kembel_2010; @Stegen_Quantifying_2013].
The trans_nullmodel class provides an encapsulation, including the calculation of the phylogenetic signal, 
beta mean pairwise phylogenetic distance (betaMPD), beta mean nearest taxon distance (betaMNTD), 
beta nearest taxon index (betaNTI), beta net relatedness index (betaNRI) and Bray-Curtis-based Raup-Crick (RCbray).
The approach for phylogenetic signal analysis is based on the mantel correlogram [@Liu_Long_term_2017], 
in which the change of phylogenetic signal is intuitional and clear compared to other approaches.
The combinations between RCbray and betaNTI can be used to infer the strength of each ecological process dominating the community assembly 
under the specific hypothesis [@Stegen_Quantifying_2013].

### Example

We first check the phylogenetic signal.

```{r, echo = TRUE}
# generate trans_nullmodel object
# as an example, we only use high abundance OTU with mean relative abundance > 0.0005
t1 <- trans_nullmodel$new(dataset, filter_thres = 0.0005, add_data = env_data_16S)
```

```{r, echo = TRUE, eval=FALSE}
# use pH as the test variable
t1$cal_mantel_corr(use_env = "pH")
# return t1$res_mantel_corr
# plot the mantel correlogram
t1$plot_mantel_corr()
```

```{r, out.width = "600px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_mantel_corr.png")
```

betaNRI(ses.betampd) is used to show the 'basal' phylogenetic turnover.
Compared to betaNTI, it can capture more turnover information associated with the deep phylogeny.
It is noted that there are many null models with the development in the several decades of community ecology.
In the trans_nullmodel class, 
the default null mode of betaNTI and betaNRI is the randomization of the phylogenetic relatedness among species.
This shuffling approach fix the observed levels of species α-diversity and β-diversity to
explore whether the observed phylogenetic turnover significantly differ from null model that phylogenetic relatedness among species are random.

```{r, echo = TRUE, eval=FALSE}
# see null.model parameter for other null models
# null model run 500 times for the example
t1$cal_ses_betampd(runs = 500, abundance.weighted = TRUE)
# return t1$res_ses_betampd
```

If we want to plot the betaNRI, we can use plot_group_distance function in trans_beta class.
For example, the results showed that the mean betaNRI of TW is extremely and significantly larger that those in CW and IW,
revealing that the basal phylogenetic turnover in TW is high.


```{r, echo = TRUE, eval=FALSE}
# add betaNRI matrix to beta_diversity list
dataset$beta_diversity[["betaNRI"]] <- t1$res_ses_betampd
# create trans_beta class, use measure "betaNRI"
t2 <- trans_beta$new(dataset = dataset, group = "Group", measure = "betaNRI")
# transform the distance for each group
t2$cal_group_distance()
# see the help document for more methods, e.g. "anova" and "KW_dunn"
t2$cal_group_distance_diff(method = "wilcox")
# plot the results
g1 <- t2$plot_group_distance()
g1 + geom_hline(yintercept = -2, linetype = 2) + geom_hline(yintercept = 2, linetype = 2)
```

```{r, out.width = "550px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_betaNRI_one_dataset.png")
```

Sometimes, if you want to perform null model analysis for each group individually, such as one group as one species pool,
you should calculate the results for each group, respectively.
The results show that, when we perform betaNRI for each group respectively, 
mean betaNRI between CW and TW are not significantly different, and they are both significantly higher than that in IW, 
revealing that the strength of variable selection in CW and TW may be similar under the condition that each area is considered as a specific species pool.

```{r, echo = TRUE, eval=FALSE}
# we create a list to store the trans_nullmodel results.
sesbeta_each <- list()
group_col <- "Group"
all_groups <- unique(dataset$sample_table[, group_col])
# calculate for each group, respectively
for(i in all_groups){
	# like the above operation, but need provide 'group' and 'select_group'
	test <- trans_nullmodel$new(dataset, group = group_col, select_group = i, filter_thres = 0.0005)
	test$cal_ses_betampd(runs = 500, abundance.weighted = TRUE)
	sesbeta_each[[i]] <- test$res_ses_betampd
}
# merge and reshape to generate one symmetrical matrix
test <- lapply(sesbeta_each, reshape2::melt) %>% 
	do.call(rbind, .) %>%
	reshape2::dcast(., Var1~Var2, value.var = "value")
rownames(test) <- test[, 1]
test <- test[, -1, drop = FALSE]
# like the above operation
dataset$beta_diversity[["betaNRI"]] <- test
t2 <- trans_beta$new(dataset = dataset, group = "Group", measure = "betaNRI")
t2$cal_group_distance()
# statistical analysis
t2$cal_group_distance_diff(method = "wilcox")
g1 <- t2$plot_group_distance()
g1 + geom_hline(yintercept = -2, linetype = 2) + geom_hline(yintercept = 2, linetype = 2)
```

```{r, out.width = "550px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_betaNRI_each_dataset.png")
```


BetaNTI(ses.betamntd) can be used to indicate the phylogenetic terminal turnover [@Stegen_Quantifying_2013].

```{r, echo = TRUE, eval=FALSE}
# null model run 500 times
t1$cal_ses_betamntd(runs=500, abundance.weighted = TRUE)
# return t1$res_ses_betamntd
```

```{r, echo = FALSE}
# read
t1$res_ses_betamntd <- as.matrix(read.csv("Files/cal_ses_betamntd.csv", row.names = 1, check.names = FALSE))
```

```{r, echo = FALSE}
# show t1$res_ses_betamntd
pander::pander(t1$res_ses_betamntd[1:5, 1:5])
```


RCbray (Bray-Curtis-based Raup-Crick) can be calculated using function cal_rcbray()
to assess whether the compositional turnover was governed primarily by drift [@Chase_null_2011].
We applied null model to simulate species distribution by randomly sampling individuals from each 
species pool with preserving species occurrence frequency and sample species richness [@Liu_Long_term_2017].

```{r, echo = TRUE, eval=FALSE}
# result stored in t1$res_rcbray
t1$cal_rcbray(runs = 1000)
# return t1$res_rcbray
```

```{r, echo = FALSE}
# read
t1$res_rcbray <- as.matrix(read.csv("Files/cal_rcbray.csv", row.names = 1, check.names = FALSE))
```

As an example, we also calculate the proportion of the inferred processes on the community assembly as shown in the references [@Stegen_Quantifying_2013; @Liu_Long_term_2017].
In the example, the fraction of pairwise comparisons with significant betaNTI values (|βNTI| > 2) is the estimated influence of Selection;
βNTI > 2 represents the heterogeneous selection; βNTI < -2 represents the homogeneous selection.
The value of RCbray characterizes the magnitude of deviation between observed Bray–Curtis and Bray–Curtis expected under the randomization;
a value of |RCbray| > 0.95 was considered significant.
The fraction of all pairwise comparisons with |βNTI| < 2 and RCbray > +0.95 was taken as the influence of Dispersal Limitation combined with Drift.
The fraction of all pairwise comparisons with |βNTI| < 2 and RCbray < -0.95 was taken as an estimate for the influence of Homogenizing Dispersal.
The fraction of all pairwise comparisons with |βNTI| < 2 and |RCbray| < 0.95 estimates the influence of Drift acting alone.

```{r, echo = TRUE}
# use betaNTI and rcbray to evaluate processes
t1$cal_process(use_betamntd = TRUE)
# return t1$res_process
```

```{r, echo = TRUE, eval = FALSE}
t1$res_process
```

```{r, echo = FALSE}
pander::pander(t1$res_process)
```

The cal_tNST function can be used to calculate normalized stochasticity ratio based on the NST package [@Ning_general_2019].

```{r, echo = TRUE, eval = FALSE}
t1$cal_tNST(group = "Group", dist.method = "bray", abundance.weighted = TRUE, output.rand = TRUE, SES = TRUE)
# result: t1$res_tNST
t1$res_tNST$index.grp
```

```{r, echo = FALSE}
t1$cal_tNST(group = "Group", dist.method = "bray", abundance.weighted = TRUE, output.rand = TRUE, SES = TRUE)
pander::pander(t1$res_tNST$index.grp)
```


```{r, echo = TRUE, eval = FALSE}
# test the NST difference between each pair of groups
t1$cal_tNST_test(method = "nst.boot")
```

For nearest Taxon Index (NTI) and nearest Relative Index (NRI), please use cal_NTI and cal_NRI, respectively.

```{r, echo = TRUE, eval = FALSE}
t1$cal_NRI(null.model = "taxa.labels", abundance.weighted = FALSE, runs = 999)
t1$cal_NTI(null.model = "taxa.labels", abundance.weighted = TRUE, runs = 999)
```


### Key points

  + trans_nullmodel$new: filter_thres parameter for the filtering of taxa with relative low abundance
  + cal_rcbray(): if only need rcbray, ignore other phylogenetic operations
 
### Other function

  + cal_Cscore(): calculates the (normalised) mean number of checkerboard combinations (C-score) using C.score


## trans_classifier class

The trans_classifier class is a wrapper for methods of machine-learning-based classification models.
Microbiome-based supervised machine-learning has been successful in predicting human health status [@Poore_Microbiome_2020]
and soil categories [@Wilhelm_Predicting_2021]. 


### Dependencies

Before starting the examples, make sure those packages have been installed.

```{r, echo = TRUE, eval = FALSE}
packages <- c("Boruta", "parallel", "rsample", "randomForest", "caret", "gridExtra", "multiROC")
# Now check or install
for(x in packages){
	if(!require(x, character.only = TRUE)) {
		install.packages(x, dependencies = TRUE)
	}
}
```

### Examples
In this section, we use the example data in file2meco package (https://chiliubio.github.io/microeco_tutorial/file2meco-package.html) to demonstrate the feature selection, 
data training and prediction with random forest algorithm.

```{r, echo = TRUE, eval=FALSE}
library(file2meco)
abund_file_path <- system.file("extdata", "dada2_table.qza", package="file2meco")
sample_file_path <- system.file("extdata", "sample-metadata.tsv", package="file2meco")
taxonomy_file_path <- system.file("extdata", "taxonomy.qza", package="file2meco")
# construct microtable object
d1 <- qiime2meco(feature_table = abund_file_path, sample_table = sample_file_path, taxonomy_table = taxonomy_file_path)
d1$cal_abund()

# initialize: use "genotype" as response variable
# x.predictors parameter is used to select the taxa; here we use all the taxa data in d1$taxa_abund
t1 <- trans_classifier$new(dataset = d1, y.response = "genotype", x.predictors = "All")
```

We silit the data into training and testing set.

```{r, echo = TRUE, eval=FALSE}
# generate train and test set
t1$cal_split(prop.train = 3/4)
```

Before training the model, we run the set_trainControl to invoke the trainControl function of caret package to generate the parameters used for training.
Here we use the default parameters in trainControl function.

```{r, echo = TRUE, eval=FALSE}
# require caret package
t1$set_trainControl()
```

Now let's start model training with rf method.

```{r, echo = TRUE, eval=FALSE}
# use default parameter method = "rf"
t1$cal_train(max.ntree = 500)
```

We can use cal_predict function to predict the testing data set.

```{r, echo = TRUE, eval=FALSE}
t1$cal_predict()
# plot the confusionMatrix to check out the performance
t1$plot_confusionMatrix()
```

```{r, out.width = "600px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_confusionMatrix_without_selection.png")
```

Using cal_ROC and plot_ROC can get the ROC (Receiver Operator Characteristic) curve.

```{r, echo = TRUE, eval=FALSE}
t1$cal_ROC()
t1$plot_ROC(size = 0.5, alpha = 0.7)
```

```{r, out.width = "500px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_ROC_without_selection.png")
```

While building a machine learning model for microbiome data,
the huge diversity of microbial community and/or associated relationships among taxa accross phylogeny can lead to a large number of unnecessary features,
which can reduce the overall accuracy, increase the complexity and overfit of the model and decrease the generalization capability of the model.
So, feature selection is one important step in building machine-learning model.
Then, we attempt to use Boruta package [@Kursa_Feature_2010] to do feature selection.

```{r, echo = TRUE, eval=FALSE}
# require Boruta package
t1$cal_feature_sel(boruta.maxRuns = 300, boruta.pValue = 0.01)
```

To compare the results between the procedure with feature selection and that without feature selection,
we also perfom all the analysis with feature selection to show the whole results.

```{r, echo = TRUE, eval=FALSE}
t2 <- trans_classifier$new(dataset = d1, y.response = "genotype", x.predictors = "All")
t2$cal_feature_sel(boruta.maxRuns = 300, boruta.pValue = 0.01)
t2$cal_split(prop.train = 3/4)
t2$set_trainControl()
t2$cal_train(max.ntree = 500)
t2$cal_predict()
t2$plot_confusionMatrix()
t2$cal_ROC()
t2$plot_ROC(size = 0.5, alpha = 0.7)
```

```{r, out.width = "600px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_confusionMatrix_with_selection.png")
```
```{r, out.width = "500px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_ROC_with_selection.png")
```

To plot the Precision-Recall curve (PR curve), please make plot_type = "PR" in plot_ROC function.

```{r, echo = TRUE, eval=FALSE}
t2$plot_ROC(plot_type = "PR", size = 0.5, alpha = 0.7)
```
```{r, out.width = "500px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_PR_with_selection.png")
```

To show the ROC curve or PR curve of the training result, please make input = "train" in plot_ROC function.

```{r, echo = TRUE, eval=FALSE}
t2$cal_ROC(input = "train")
t2$plot_ROC(plot_type = "ROC", size = 0.5, alpha = 0.7)
```
```{r, out.width = "500px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_ROC_with_selection_training.png")
```

For other machine-learning models, please use method parameter in cal_train function.
```{r, echo = TRUE, eval=FALSE}
# use SVM method
t2$al_train(method = "svmRadial", tuneLength = 15)
```


### Key points

  + cal_feature_sel(): perform feature selection
  
### Other function

  + cal_feature_imp(): get feature importance from the training model when method is "rf"
  + cal_preProcess(): Pre-process (centering, scaling etc.) of the feature data based on the caret::preProcess function.


<!--chapter:end:05-Model-based_class.Rmd-->

# Explainable class

We group trans_env and trans_func classes into 'Explainable class', 
as environmental factors and microbial functions can be generally applied to explain microbial community structure and assembly.


## trans_env class
There may be some NA (missing value) in the user's env data.
If so, please add `complete_na = TRUE` for interpolation when creating the trans_env object.

### Example

Creating trans_env object has at least two ways.
The following is using additional environmental data which is not in the microtable object.
```{r, echo = TRUE}
# add_data is used to add the environmental data
t1 <- trans_env$new(dataset = dataset, add_data = env_data_16S[, 4:11])
```

Maybe a more general way is to directly use the data from sample_table of your microtable object.
To show this operation, we first merge additional table into sample_table to generate a new microtable object.

```{r, echo = TRUE}
new_test <- clone(dataset)
new_test$sample_table <- data.frame(new_test$sample_table, env_data_16S[rownames(new_test$sample_table), ])
# now new_test$sample_table has the whole data
new_test
```

Now let's use env_cols to select the required columns from sample_table in the microtable object.

```{r, echo = TRUE}
t1 <- trans_env$new(dataset = new_test, env_cols = 8:15)
```

Generally, it is beneficial to the understanding on environmental variables in order to better use more methods.
So, we first show the cal_diff and cal_autocor functions.
The cal_diff function is used to test the significance of variables across groups like we have shown in trans_alpha and trans_diff class parts.


```{r, echo = TRUE, eval = FALSE}
# use Wilcoxon Rank Sum and Signed Rank Tests as an example
t1$cal_diff(group = "Group", method = "wilcox")
t1$res_diff[, c(1, 2, 4, 6, 7)]
```
```{r, echo = FALSE}
t1$cal_diff(group = "Group", method = "wilcox")
pander::pander(t1$res_diff[1:7, c(1, 2, 4, 6, 7)])
```

Let’s perform the anova and show the letters in the box plot. We use list to store all the plots for each factor and plot them together.

```{r, echo = TRUE, eval = FALSE}
t1$cal_diff(method = "anova", group = "Group")
# place all the plots into a list
tmp <- list()
for(i in colnames(t1$data_env)){
    tmp[[i]] <- t1$plot_diff(measure = i, add_sig_text_size = 5, xtext_size = 12) + theme(plot.margin = unit(c(0.1, 0, 0, 1), "cm"))
}
plot(gridExtra::arrangeGrob(grobs = tmp, ncol = 3))
```
```{r, out.width = "750px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_env_diff_all.png")
```


From the v0.12.0, the trans_env class supports the differential test of groups within each group by using the by_group parameter in cal_diff function.

```{r, echo = TRUE, eval = FALSE}
t1$cal_diff(group = "Type", by_group = "Group", method = "anova")
t1$plot_diff(measure = "pH", add_sig_text_size = 5)
```

```{r, out.width = "600px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_env_diff_bygroup.png")
```


Then we show the autocorrelations among variables.

```{r, echo = TRUE, eval = FALSE}
# require GGally package installed
t1$cal_autocor()
```
```{r, out.width = "800px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/trans_env_autocor1.png")
```

For different groups, please use group parameter to show the distributions of variables and the autocorrelations across groups.

```{r, echo = TRUE, eval = FALSE}
t1$cal_autocor(group = "Group")
```
```{r, out.width = "800px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/trans_env_autocor_group.png")
```

Then let's do the RDA analysis (db-RDA and RDA).

```{r, echo = TRUE, eval = FALSE}
# use bray-curtis distance to do dbrda
t1$cal_ordination(method = "dbRDA", use_measure = "bray")
# t1$res_rda is the result list stored in the object
t1$trans_ordination(adjust_arrow_length = TRUE, max_perc_env = 1.5)
# t1$res_rda_trans is the transformed result for plotting
t1$plot_ordination(plot_color = "Group")
```

```{r, out.width = "650px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_rda_dbrda.png")
```

From v0.14.0, the function `cal_ordination_anova` is implemented to check the significance of the ordination model instead of the encapsulation in `cal_ordination`.
Furthermore, the function `cal_ordination_envfit` can be used to get the contribution of each variables to the model.

```{r, echo = TRUE, eval = FALSE}
t1$cal_ordination_anova()
t1$cal_ordination_envfit()
```

Then, let's try to do RDA at the Genus level.

```{r, echo = TRUE, eval = FALSE}
# use Genus
t1$cal_ordination(method = "RDA", taxa_level = "Genus")
# As the main results of RDA are related with the projection and angles between different arrows,
# we adjust the length of the arrow to show them clearly using several parameters.
t1$trans_ordination(show_taxa = 10, adjust_arrow_length = TRUE, max_perc_env = 1.5, max_perc_tax = 1.5, min_perc_env = 0.2, min_perc_tax = 0.2)
# t1$res_rda_trans is the transformed result for plotting
t1$plot_ordination(plot_color = "Group")
```

```{r, out.width = "650px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_rda_genus.png")
```

For more plotting ways, run the following examples.

```{r, echo = TRUE, eval = FALSE}
t1$plot_ordination(plot_color = "Group", plot_shape = "Group")
t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_type = c("point", "ellipse"))
t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_type = c("point", "centroid"))
t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_type = c("point", "chull"))
t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_type = c("point", "ellipse", "centroid"))
t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_type = c("point", "chull", "centroid"), add_sample_label = "SampleID")
t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_type = "centroid", centroid_segment_alpha = 0.9, centroid_segment_size = 1, centroid_segment_linetype = 1)
t1$plot_ordination(plot_color = "Type", plot_type = c("point", "centroid"), centroid_segment_linetype = 1)
```


Mantel test can be used to check whether there is significant correlations between environmental variables and distance matrix.

```{r, echo = TRUE, eval = FALSE}
t1$cal_mantel(use_measure = "bray")
# return t1$res_mantel
t1$res_mantel
```

```{r, echo = FALSE}
t1$cal_mantel(use_measure = "bray")
pander::pander(t1$res_mantel[, -c(2:3)])
```

For the combination of mantel test and correlation heatmap, 
please see another example (https://chiliubio.github.io/microeco_tutorial/other-examples-1.html#mantel-test-correlation-heatmap).  


The correlations between environmental variables and taxa are important in analyzing and inferring the factors affecting community structure.
Let's first perform a correlation heatmap using relative abundance data at Genus level with the `cal_cor` function.
The parameter `p_adjust_type` can control the p value adjustment type. 
The default `p_adjust_type = "Env"` means p adjustment is performed for each environmental variable separately.
If the user needs to adjust p values for all the results together, please use `p_adjust_type = "Type"`.


```{r, echo = TRUE}
t1 <- trans_env$new(dataset = dataset, add_data = env_data_16S[, 4:11])
t1$cal_cor(use_data = "Genus", p_adjust_method = "fdr", p_adjust_type = "Env")
# return t1$res_cor
```

Then, we can plot the correlation results using plot_cor function.

```{r, echo = TRUE, eval = FALSE}
# default ggplot2 method with clustering
t1$plot_cor()
```

There are too many genera.
We can use the filter_feature parameter in plot_cor to filter some taxa that do not have any significance < 0.001.

```{r, echo = TRUE, eval = FALSE}
# filter genera that donot have at least one ***
t1$plot_cor(filter_feature = c("", "*", "**"))
```

Sometimes, if the user wants to do the correlation analysis between the environmental factors and some important taxa detected in the biomarker analysis,
please use **other_taxa** parameter in cal_cor function.


```{r, echo = TRUE, eval = FALSE}
# first create trans_diff object as a demonstration
t2 <- trans_diff$new(dataset = dataset, method = "rf", group = "Group", rf_taxa_level = "Genus")
# then create trans_env object
t1 <- trans_env$new(dataset = dataset, add_data = env_data_16S[, 4:11])
# use other_taxa to select taxa you need
t1$cal_cor(use_data = "other", p_adjust_method = "fdr", other_taxa = t2$res_diff$Taxa[1:40])
t1$plot_cor()
```

```{r, out.width = "700px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_corr_ggplot.png")
```

The pheatmap method is also available.
Note that, besides the **color_vector parameter**, 
**color_palette** can also be used to control color palette with customized colors.

```{r, echo = TRUE, eval = FALSE}
# clustering heatmap; require pheatmap package
# Let's take another color pallete
t1$plot_cor(pheatmap = TRUE, color_palette = rev(RColorBrewer::brewer.pal(n = 9, name = "RdYlBu")))
```

```{r, out.width = "700px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_corr_pheatmap.png")
```

Sometimes, if it is necessary to study the correlations between environmental variables and taxa for different groups,
**by_group parameter** can be used for this goal.

```{r, echo = TRUE, eval = FALSE}
# calculate correlations for different groups using parameter by_group
t1$cal_cor(by_group = "Group", use_data = "other", p_adjust_method = "fdr", other_taxa = t2$res_diff$Taxa[1:40])
# return t1$res_cor
t1$plot_cor()
```

```{r, out.width = "700px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_corr_ggplot_groups.png")
```


If the user is concerned with the relationship between environmental factors and alpha diversity, 
please use **add_abund_table parameter** in the cal_cor function.

```{r, echo = TRUE, eval = FALSE}
t1 <- trans_env$new(dataset = dataset, add_data = env_data_16S[, 4:11])
# use add_abund_table parameter to add the extra data table
t1$cal_cor(add_abund_table = dataset$alpha_diversity)
# try to use ggplot2 with clustering plot
# require ggtree and aplot packages to be installed (https://chiliubio.github.io/microeco_tutorial/intro.html#dependence)
t1$plot_cor(cluster_ggplot = "row")
```

```{r, out.width = "600px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_corr_alpha_diversity.png")
```

The function plot_scatterfit() in trans_env class is designed for the scatter plot, adding the fitted line and statistics of correlation or regression.

```{r, echo = TRUE, eval = FALSE}
# use pH and bray-curtis distance
# add correlation statistics
t1$plot_scatterfit(
	x = "pH", 
	y = dataset$beta_diversity$bray[rownames(t1$data_env), rownames(t1$data_env)], 
	type = "cor",
	point_alpha = 0.1, label.x.npc = "center", label.y.npc = "bottom", 
	x_axis_title = "Euclidean distance of pH", 
	y_axis_title = "Bray-Curtis distance"
	)
```

```{r, out.width = "550px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_scatterfit_cor.png")
```

```{r, echo = TRUE, eval = FALSE}
# regression with type = "lm", use group parameter for different groups
t1$plot_scatterfit(
    x = dataset$beta_diversity$bray[rownames(t1$data_env), rownames(t1$data_env)],
	y = "pH",
	type = "lm", 
	group = "Group", 
	group_order = c("CW", "TW", "IW"),
	point_size = 3, point_alpha = 0.3, line_se = FALSE, line_size = 1.5, shape_values = c(16, 17, 7),
    y_axis_title = "Euclidean distance of pH", x_axis_title = "Bray-Curtis distance"
)
```

```{r, out.width = "550px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_scatterfit_lmgroup.png")
```


### Key points

  + complete_na parameter in trans_env$new: used to fill the NA (missing value) of the environmental data based on the mice package.
  + env_cols parameter in trans_env$new: select the variables from sample_table of your microtable object.
  + add_abund_table parameter in cal_cor: other customized data can be also provided for the correlation analysis.
  + use_cor parameter in plot_scatterfit: both the correlation and regression are available in this function.
  + cal_mantel(): partial_mantel = TRUE can be used for partial mantel test.
  + plot_ordination(): use plot_type parameter to select point types and env_nudge_x and taxa_nudge_x (also _y) to adjust the text positions.


## trans_func class

　Ecological researchers are usually interested in the the funtional profiles of microbial communities,
because functional or metabolic data is powerful to explain the structure and dynamics of microbial communities.
As metagenomic sequencing is complicated and expensive, using amplicon sequencing data to predict functional profiles is an alternative choice. 
Several software are often used for this goal, such as PICRUSt [@Langille_Predictive_2013], Tax4Fun [@Aßhauer_Tax4Fun_2015] and FAPROTAX [@Louca_High_2016; @Louca_Decoupling_2016].
These tools are great to be used for the prediction of functional profiles based on the prokaryotic communities from sequencing results.
In addition, it is also important to obtain the traits or functions for each taxa, not just the whole profile of communities.
FAPROTAX database is a collection of the traits and functions of prokaryotes based on the known research results published in books and literatures.
We match the taxonomic information of prokaryotes against this database to predict the traits of prokaryotes on biogeochemical roles.
The NJC19 database [@Lim_Large_2020] is also available for animal-associated prokaryotic data, such as human gut microbiota.
We also implement the FUNGuild [@Nguyen_FUNGuild_2016] and FungalTraits [@Polme_FungalTraits_2020] databases to predict the fungal traits.
The idea identifying prokaryotic traits and functional redundancy was initially inspired by our another study [@Liu_Microbial_2022].


### Example

We first identify/predict traits of taxa with the prokaryotic example data.

```{r, echo = TRUE}
# create object of trans_func
t2 <- trans_func$new(dataset)
# mapping the taxonomy to the database
# this can recognize prokaryotes or fungi automatically if the names of taxonomic levels are standard.
# for fungi example, see https://chiliubio.github.io/microeco_tutorial/other-dataset.html#fungi-data
# default database for prokaryotes is FAPROTAX database
t2$cal_spe_func(prok_database = "FAPROTAX")
# return t2$res_spe_func, 1 represent trait exists, 0 represent no or cannot confirmed.
```

```{r, echo = TRUE, eval = FALSE}
t2$res_spe_func[1:5, 1:2]
```

```{r, echo = FALSE}
pander::pander(t2$res_spe_func[1:5, 1:2])
```

The percentages of the OTUs having the same trait can reflect the functional redundancy of this function in the community.

```{r, echo = TRUE}
# calculate the percentages for communities
# here do not consider the abundance
t2$cal_spe_func_perc(abundance_weighted = FALSE)
# t2$res_spe_func_perc[1:5, 1:2]
```

```{r, echo = FALSE}
pander::pander(t2$res_spe_func_perc[1:5, 1:2])
```

Then we also take an example to show the percentages of the OTUs for each trait in network modules.

```{r, echo = TRUE, eval = FALSE}
# construct a network for the example
network <- trans_network$new(dataset = dataset, cal_cor = "base", taxa_level = "OTU", filter_thres = 0.0001, cor_method = "spearman")
network$cal_network(p_thres = 0.01, COR_cut = 0.7)
network$cal_module()
# convert module info to microtable object
meco_module <- network$trans_comm(use_col = "module")
meco_module_func <- trans_func$new(meco_module)
meco_module_func$cal_spe_func(prok_database = "FAPROTAX")
meco_module_func$cal_spe_func_perc(abundance_weighted = FALSE)
meco_module_func$plot_spe_func_perc(order_x = paste0("M", 1:10))
```
```{r, out.width = "700px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_func_perc_module.png")
```

```{r, echo = TRUE, eval = FALSE}
# If you want to change the group list, reset the list t2$func_group_list
t2$func_group_list
# use show_prok_func to see the detailed information of prokaryotic traits
t2$show_prok_func("methanotrophy")
```


```{r, echo = TRUE, eval = FALSE}
# then we try to correlate the res_spe_func_perc of communities to environmental variables
t3 <- trans_env$new(dataset = dataset, add_data = env_data_16S[, 4:11])
t3$cal_cor(add_abund_table = t2$res_spe_func_perc, cor_method = "spearman")
t3$plot_cor(pheatmap = TRUE)
```

```{r, out.width = "800px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_func_perc_corr.png")
```


Tax4Fun [@Aßhauer_Tax4Fun_2015] requires a strict input file format associated with the taxonomic information.
To analyze the trimmed or changed OTU data in R with Tax4Fun, we provide a link to the Tax4Fun functional prediction.
Please check out the dependence part https://chiliubio.github.io/microeco_tutorial/intro.html#tax4fun for installing Tax4Fun package and download SILVA123 ref data.

```{r, echo = TRUE, eval = TRUE, message=FALSE}
t1 <- trans_func$new(dataset)
# https://chiliubio.github.io/microeco_tutorial/intro.html#tax4fun for the installation description
# and provide the file path of SILVA123
t1$cal_tax4fun(folderReferenceData = "./SILVA123")
# return two files: t1$tax4fun_KO: KO file; t1$tax4fun_path: pathway file.
# t1$tax4fun_KO$Tax4FunProfile[1:5, 1:2]
```


```{r, echo = FALSE}
pander::pander(t1$tax4fun_KO$Tax4FunProfile[1:5, 1:2])
```


We further analyze the abundance of predicted metabolic pathways.

```{r, echo = TRUE, eval = TRUE}
# must transpose to taxa row, sample column
pathway_file <- t1$tax4fun_path$Tax4FunProfile %>% t %>% as.data.frame
# filter rownames, only keep ko+number
rownames(pathway_file) %<>% gsub("(^.*);\\s.*", "\\1", .)
# load the pathway hierarchical metadata
data(Tax4Fun2_KEGG)
# further create a microtable object, familiar?
func1 <- microtable$new(otu_table = pathway_file, tax_table = Tax4Fun2_KEGG$ptw_desc, sample_table = t1$sample_table)
print(func1)
```

Now, we need to trim data and calculate abundance.
```{r, echo = TRUE, eval = TRUE}
func1$tidy_dataset()
# calculate abundance automatically at three levels: Level.1, Level.2, Level.3
func1$cal_abund()
print(func1)
```

Then, we can plot the abundance.
```{r, echo = TRUE, eval = FALSE}
# bar plot at Level.1
func2 <- trans_abund$new(func1, taxrank = "Level.1", groupmean = "Group")
func2$plot_bar(legend_text_italic = FALSE)
```

```{r, out.width = "600px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_bar_tax4fun1.png")
```

We can also do something else. For example, we can use lefse to test the differences of the abundances and find the important enriched pathways across groups.

```{r, echo = TRUE, eval = FALSE}
func2 <- trans_diff$new(dataset = func1, method = "lefse", group = "Group", alpha = 0.05, lefse_subgroup = NULL)
func2$plot_diff_bar(threshold = 3, width = 0.8)
```

```{r, out.width = "600px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_lefse_bar_tax4fun.png")
```

Tax4Fun2 [@Wemheuer_Tax4Fun2_2020] is another R package for the prediction of functional profiles of prokaryotic communities from 16S rRNA gene sequences.
It also provides two indexes for the evaluation of functional gene redundancies.
If the user want to use Tax4Fun2 method, the representative fasta file is necessary to be added in the microtable object.
Please check out https://chiliubio.github.io/microeco_tutorial/intro.html#tax4fun2 to see
how to read fasta file with `read.fasta` of seqinr package or `readDNAStringSet` of Biostrings package.
Please also see https://chiliubio.github.io/microeco_tutorial/intro.html#tax4fun2 for downloading ncbi-blast and Ref99NR/Ref100NR.
For windows system, ncbi-blast-2.5.0+ is recommended since other versions can not operate well.


```{r, echo = TRUE, eval = FALSE}
# first delete the dataset created before
rm(dataset)
# load the example dataset from microeco package as there is the rep_fasta object in it
data(dataset)
dataset

t1 <- trans_func$new(dataset)
# create a directory for result and log files
dir.create("test_prediction")
# https://chiliubio.github.io/microeco_tutorial/intro.html#tax4fun2 for installation
# ignore blast_tool_path parameter if blast tools have been in path
# the function can search whether blast tool directory is in the path, if not, automatically use provided blast_tool_path parameter
t1$cal_tax4fun2(blast_tool_path = "ncbi-blast-2.5.0+/bin", path_to_reference_data = "Tax4Fun2_ReferenceData_v2",
  database_mode = "Ref99NR", path_to_temp_folder = "test_prediction")

# prepare feature table and metadata
data(Tax4Fun2_KEGG)
# create a microtable object for pathways
func2 <- microtable$new(otu_table = t1$res_tax4fun2_pathway, tax_table = Tax4Fun2_KEGG$ptw_desc, sample_table = dataset$sample_table)
func2$tidy_dataset()
func2$cal_abund()

# calculate functional redundancies
t1$cal_tax4fun2_FRI()
```

### Key points

  + blast_tool_path parameter in cal_tax4fun2: if the blast tool has been in 'environment variable' of computer, it is ok to use blast_tool_path = NULL 
  + blast version: tax4fun2 require NCBI blast tool. However, some errors often come from the latest versions (https://www.biostars.org/p/413294/). An easy solution is to use previous version (such as v2.5.0).


<!--chapter:end:06-Explainable_class.Rmd-->

# file2meco package

In the microtable class part, we showed the basic way about how to create microtable object with the example data.
Actually, constructing the microtable object from other tools/platforms (e.g., QIIME, QIIME2, HUMAnN, Kraken2 and phyloseq) 
can be easily achieved with the package file2meco (https://github.com/ChiLiubio/file2meco).
The idea of creating file2meco package comes from a study involved in complex metagenomic analysis [@Liu_Microbial_2022].
The following examples are applied to the file2meco package v0.5.0.

```{r, echo = TRUE, eval = FALSE}
# install file2meco
# https://github.com/ChiLiubio/file2meco
if(!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
install.packages("file2meco", repos = BiocManager::repositories())
```
▲ Trouble shooting:  
• The files in the following examples all come from the package file2meco and are found by the function `system.file` automatically irrespective of Operating System.
When the user imports a file, please donot use this function.
The first thing should be to make sure that R can find your input file.
The user should either provide a full path to the import function or only the file name after putting the file into the R working directory.
For the latter one, if the user does not know where the working directory is,
please use the function `getwd` to find the working directory or directly create a new project in the target directory with RStudio [File --> New Project].


## QIIME

The qiime1meco() function can be used to construct the microtable object using the raw OTU file from QIIME 1 [@Caporaso_QIIME_2010].

```{r, echo = TRUE, eval = FALSE}
library(file2meco)
# see the help document
?qiime1meco
# Let's run the examples
# use the raw data files stored inside the package
otu_file_path <- system.file("extdata", "otu_table_raw.txt", package="file2meco")
sample_file_path <- system.file("extdata", "sample_info.csv", package="file2meco")
phylo_file_path <- system.file("extdata", "rep_phylo.tre", package="file2meco")
# if you want to use Tax4Fun2 approach, you need read the representative sequences and add it to the microtable object.
rep_fasta_path <- system.file("extdata", "rep.fna", package="file2meco")
# contruct microtable object
qiime1meco(otu_file_path)
qiime1meco(otu_file_path, sample_table = sample_file_path)
qiime1meco(otu_file_path, sample_table = sample_file_path, phylo_tree = phylo_file_path)
qiime1meco(otu_file_path, sample_table = sample_file_path, phylo_tree = phylo_file_path, rep_fasta = rep_fasta_path)
```


## QIIME2

The qiime2meco() function is designed to create the microtable object using files from QIIME2 [@Bolyen_Reproducible_2019].

```{r, echo = TRUE, eval = FALSE}
library(file2meco)
?qiime2meco
# use data files inside the package which were downloaded from (https://docs.qiime2.org/2022.2/tutorials/pd-mice/).
abund_file_path <- system.file("extdata", "dada2_table.qza", package="file2meco")
sample_file_path <- system.file("extdata", "sample-metadata.tsv", package="file2meco")
taxonomy_file_path <- system.file("extdata", "taxonomy.qza", package="file2meco")
# construct microtable object
qiime2meco(abund_file_path)
qiime2meco(abund_file_path, sample_table = sample_file_path, taxonomy_table = taxonomy_file_path)
# add phylogenetic tree and fasta for more demonstrations
# please download tree from https://docs.qiime2.org/2022.2/data/tutorials/pd-mice/tree.qza
# the file name is 'tree.qza'; put it into the R working directory
tree_data <- "tree.qza"
# please download fasta from https://docs.qiime2.org/2022.2/data/tutorials/pd-mice/dada2_rep_set.qza
# the file name is 'dada2_rep_set.qza'; put it into the R working directory
rep_data <- "dada2_rep_set.qza"
test1 <- qiime2meco(abund_file_path, sample_table = sample_file_path, taxonomy_table = taxonomy_file_path, phylo_tree = tree_data, rep_fasta = rep_data, auto_tidy = TRUE)
test1
```


## HUMAnN

Many methods in microeco package can be used not only for the traditional species abundance data, i.e. species-sample table,
but also for other data, such as metagenomic data.
HUMAnN [@Franzosa_Species_2018] is an excellent tool for functional profiling analysis of metagenomes and metatranscriptomes at species-level.
The humann2meco() function can be used to create the microtable object using metagenomic analysis files from HUMAnN3 (https://huttenhower.sph.harvard.edu/humann).
Certainly, it can also be used for the whole community profile of metabolic pathways when needed.
Currently, it supports both the MetaCyc (https://metacyc.org/) and KEGG pathway abundance file input directly.


```{r, echo = TRUE, eval = FALSE}
library(file2meco)
library(microeco)
library(magrittr)
?humann2meco
sample_file_path <- system.file("extdata", "example_metagenome_sample_info.tsv", package="file2meco")
match_file_path <- system.file("extdata", "example_metagenome_match_table.tsv", package="file2meco")

# MetaCyc pathway database based analysis
# use the raw data files stored inside the package for MetaCyc pathway database based analysis
abund_file_path <- system.file("extdata", "example_HUMAnN_MetaCyc_abund.tsv", package="file2meco")
# the default db is "MetaCyc"
humann2meco(abund_file_path, db = "MetaCyc")
humann2meco(abund_file_path, db = "MetaCyc", sample_table = sample_file_path, match_table = match_file_path)
# Let's try more interesting usages with microeco
test <- humann2meco(abund_file_path, db = "MetaCyc", sample_table = sample_file_path, match_table = match_file_path)
test$tidy_dataset()
# rel = FALSE donot use relative abundance
test$cal_abund(select_cols = 1:3, rel = FALSE)
test$taxa_abund$Superclass1 %<>% .[!grepl("unclass", rownames(.)), ]
test1 <- trans_abund$new(test, taxrank = "Superclass1", ntaxa = 10)
test1$plot_bar(facet = "Group", ylab_title = "Abundance (RPK)")
# select both function and taxa
test$cal_abund(select_cols = c("Superclass1", "Phylum", "Genus"), rel = TRUE)
test1 <- trans_abund$new(test, taxrank = "Phylum", ntaxa = 10, delete_part_prefix = T)
test1$plot_bar(facet = "Group")
# functional biomarker
test$cal_abund(select_cols = 1:3, rel = TRUE)
test$taxa_abund$Superclass1 %<>% .[!grepl("unclass", rownames(.)), ]
test1 <- trans_diff$new(test, method = "lefse", group = "Group")
test1$plot_diff_bar(use_number = 1:20)
# taxa biomarker
test$cal_abund(select_cols = 4:9, rel = TRUE)
test$taxa_abund$Phylum %<>% .[!grepl("unclass", rownames(.)), ]
# p_adjust_method = NULL shut down the p value adjustment
test1 <- trans_diff$new(test, method = "lefse", group = "Group", p_adjust_method = NULL)
test1$plot_diff_bar(threshold = 2)
```


```{r, echo = TRUE, eval = FALSE}
# use KEGG pathway based HUMAnN result
abund_file_path <- system.file("extdata", "example_HUMAnN_KEGG_abund.tsv", package="file2meco")
test <- humann2meco(abund_file_path, db = "KEGG", sample_table = sample_file_path, match_table = match_file_path)
test$tax_table %<>% subset(Level.1 != "unclassified")
test$tidy_dataset()
```


## MetaPhlAn
MetaPhlAn is an software used for metagenomic taxonomic profiling [@Truong_MeTApHLaN2_2015].
The format of MetaPhlAn classification results is usually called 'mpa' format.
The mpa2meco function is developed for this format conversion to microtable object.
See the following example of Kraken2 part.


## Kraken2
Kraken is a taxonomic sequence classifier that assigns taxonomic labels to DNA sequences.
Kraken examines the k-mers within a query sequence and uses the information within those k-mers to query a database. 
That database maps k-mers to the lowest common ancestor (LCA) of all genomes known to contain a given k-mer.
Kraken2 [@Wood_Improved_2019] is the newest version.
The merged Kraken2 results can be obtained by merge_metaphlan_tables.py from MetaPhlAn or combine_mpa.py from KrakenTools (https://ccb.jhu.edu/software/krakentools/).

```{r, echo = TRUE, eval = FALSE}
# the example is metagenomic classification result
# use the raw data files stored inside the package
abund_file_path <- system.file("extdata", "example_kraken2_merge.txt", package="file2meco")
sample_file_path <- system.file("extdata", "example_metagenome_sample_info.tsv", package="file2meco")
match_file_path <- system.file("extdata", "example_metagenome_match_table.tsv", package="file2meco")
mpa2meco(abund_file_path)
test <- mpa2meco(abund_file_path, sample_table = sample_file_path, match_table = match_file_path)
test$tidy_dataset()
```


## NCycDB

NCycDB database [@Tu_NCycDB_2018] is a curated integrative database for fast and accurate metagenomic profiling of nitrogen cycling genes.
The ncyc2meco() function is designed for construct the microtable object using gene abundance files from NCycDB (https://github.com/qichao1984/NCyc).


```{r, echo = TRUE, eval = FALSE}
library(file2meco)
library(microeco)
library(magrittr)
?ncyc2meco
# use the raw data files stored inside the package
abund_file_path <- system.file("extdata", "example_Ncyc_table.tsv", package="file2meco")
sample_file_path <- system.file("extdata", "example_metagenome_sample_info.tsv", package="file2meco")
match_file_path <- system.file("extdata", "example_metagenome_match_table.tsv", package="file2meco")
ncyc2meco(abund_file_path)
ncyc2meco(abund_file_path, sample_table = sample_file_path, match_table = match_file_path)
```

```{r, echo = TRUE, eval = FALSE}
# Let's try more interesting usages with microeco
test <- ncyc2meco(abund_file_path, sample_table = sample_file_path, match_table = match_file_path)
test$tidy_dataset()
# use split_group = TRUE to calculate the pathway abundance with multipe map correspondance
test$cal_abund(select_cols = 1:2, rel = TRUE, split_group = TRUE, split_column = "Pathway")
test$taxa_abund$Pathway %<>% .[!grepl("unclass", rownames(.)), ]
test1 <- trans_abund$new(test, taxrank = "Pathway")
test1$plot_bar(bar_type = "notfull")
# for gene abundance, no splitting on the pathways
test$cal_abund(select_cols = 1:2, rel = TRUE, split_group = FALSE)
test$taxa_abund$Gene %<>% .[!grepl("unclass", rownames(.)), ]
test1 <- trans_abund$new(test, taxrank = "Gene")
test1$plot_bar(bar_type = "notfull")
```


## phyloseq
Two functions meco2phyloseq() and phyloseq2meco() were provided for the conversion between microtable object and phyloseq object of phyloseq package [@Mcmurdie_phyloseq_2013].

```{r, echo = TRUE, eval = FALSE}
# Please first install phyloseq
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("phyloseq")
library(phyloseq)
```

```{r, echo = TRUE, eval = FALSE}
# from microtable to phyloseq object
data("dataset")
physeq <- meco2phyloseq(dataset)
physeq
```

```{r, echo = TRUE, eval = FALSE}
# from phyloseq to microtable object
data("GlobalPatterns")
meco_dataset <- phyloseq2meco(GlobalPatterns)
meco_dataset
```

## PICRUSt2

PICRUSt2 [@Douglas_PICRUSt2_2020] contains an updated and larger database of gene families and reference genomes compared to the original version of PICRUSt. 
We do not create a special file conversion function for PICRUSt2 
as it is very easy to convert the output pathway files of PICRUSt2 to microtable object. 
Two example files of PICRUSt2 output in file2meco package were used to show the operation.

```{r, echo = TRUE, eval = FALSE}
# MetaCyc pathway output
tmp_file_path <- system.file("extdata", "example_PICRUSt2_MetaCyc_path_abun_unstrat.tsv", package="file2meco")
pathway_table <- read.delim(tmp_file_path, row.names = 1)
data("MetaCyc_pathway_map")
tmp <- microtable$new(otu_table = pathway_table, tax_table = MetaCyc_pathway_map)
tmp$tidy_dataset()
tmp
```

```{r, echo = TRUE, eval = FALSE}
# KEGG pathway output
tmp_file_path <- system.file("extdata", "example_PICRUSt2_KEGG_path_abun_unstrat.tsv", package="file2meco")
pathway_table <- read.delim(tmp_file_path, row.names = 1)
data("Tax4Fun2_KEGG")
tmp <- microtable$new(otu_table = pathway_table, tax_table = Tax4Fun2_KEGG$ptw_desc)
tmp$tidy_dataset()
tmp
```


<!--chapter:end:07-file2meco.Rmd-->

# Other dataset

## Fungi data

Another ITS sequencing dataset [@Gao_Strong_2019] is also stored in the example data of the package.
Here, we use it as an example to show the use of FUNGuild database[@Nguyen_FUNGuild_2016].
FungalTraits [@Polme_FungalTraits_2020] database is also available for identifying fungal traits.


```{r, echo = TRUE, eval = FALSE}
library(microeco)
library(magrittr)
# load ITS data
data(sample_info_ITS)
data(otu_table_ITS)
data(taxonomy_table_ITS)
# create microtable object
meco_fungi <- microtable$new(sample_table = sample_info_ITS, otu_table = otu_table_ITS, tax_table = taxonomy_table_ITS)
# remove the taxa not assigned in the Kingdom "k__Fungi"
meco_fungi$tax_table %<>% base::subset(Kingdom == "k__Fungi")
# use tidy_dataset() to make OTUs and samples information consistent across files
meco_fungi$tidy_dataset()
# create trans_network object
t1 <- trans_network$new(dataset = meco_fungi, cal_cor = "WGCNA", taxa_level = "OTU", filter_thres = 0.000001, cor_method = "spearman")
# create correlation network 
t1$cal_network(COR_p_thres = 0.05, COR_cut = 0.6)
# add modules
t1$cal_module()
# convert module info to microtable object
meco_module <- t1$trans_comm(use_col = "module")
# create trans_func object
t2 <- trans_func$new(meco_module)
# identify species traits, automatically select database for prokaryotes or fungi
# fungi_database = "FungalTraits" for the FungalTraits database
t2$cal_spe_func(fungi_database = "FUNGuild")
# calculate abundance-unweighted functional redundancy of each trait for each network module
t2$cal_spe_func_perc(abundance_weighted = FALSE)
# plot the functional redundancy of network modules
t2$plot_spe_func_perc(order_x = paste0("M", 1:10))
```

```{r, out.width = "700px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_func_perc_module_fungi.png")
```

## Metagenomic data

In the file2meco package part,
we provide several examples to transform the output files of some famous metagenomic tools (e.g. HUMAnN and kraken2) to 
the microtable object directly.
Here, we show some detailed examples using KEGG pathway results.

```{r, echo = TRUE, eval = FALSE}
library(microeco)
library(file2meco)
library(magrittr)
?humann2meco
sample_file_path <- system.file("extdata", "example_metagenome_sample_info.tsv", package="file2meco")
match_file_path <- system.file("extdata", "example_metagenome_match_table.tsv", package="file2meco")
# use KEGG pathway based HUMAnN result
abund_file_path <- system.file("extdata", "example_HUMAnN_KEGG_abund.tsv", package="file2meco")
# match_table parameter can be used to replace sample names
test <- humann2meco(abund_file_path, db = "KEGG", sample_table = sample_file_path, match_table = match_file_path)
# remove the unclassified pathway in Level.1
test$tax_table %<>% subset(Level.1 != "unclassified")
test$tidy_dataset()
# rel = FALSE donot use relative abundance, use the raw RPK
test$cal_abund(select_cols = 1:3, rel = FALSE)
test1 <- trans_abund$new(test, taxrank = "Level.2", ntaxa = 10)
test1$plot_bar(facet = "Group", ylab_title = "Abundance (RPK)", color_values = RColorBrewer::brewer.pal(12, "Set3"))
```

```{r, out.width = "800px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_HUMANN_KEGG_bar.png")
```

Then, we select both function and taxa to see which taxa those high abundant pathways come from.

```{r, echo = TRUE, eval = FALSE}
# This operation is more flexible
test$cal_abund(select_cols = c("Level.1", "Phylum", "Genus"), rel = FALSE)
test$taxa_abund$Level.1 %<>% .[!grepl("unclass", rownames(.)), ]
test$taxa_abund$Phylum %<>% .[!grepl("unclass", rownames(.)), ]
test1 <- trans_abund$new(test, taxrank = "Phylum", ntaxa = 10, delete_part_prefix = T)
test1$plot_bar(facet = "Group", ylab_title = "Abundance (RPK)", color_values = RColorBrewer::brewer.pal(12, "Set3"))
```

```{r, out.width = "800px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_HUMANN_KEGG_bar_taxafunc.png")
```

Let's run LEfSe to find some functional biomarkers to differentiate two groups.

```{r, echo = TRUE, eval = FALSE}
# functional biomarker
test$cal_abund(select_cols = 1:3, rel = TRUE)
test1 <- trans_diff$new(test, method = "lefse", group = "Group")
test1$plot_diff_bar(threshold = 3)
```

```{r, out.width = "700px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_HUMANN_KEGG_lefse_bar.png")
```


## Gut microbiome

We use mouse gut data stored in file2meco package to show the input of QIIME2 file and the use of metabolic trait database NJC19 database[@Lim_Large_2020].

```{r, echo = TRUE, eval = FALSE}
library(microeco)
library(file2meco)
library(ggplot2)

# use data files inside the file2meco package.
abund_file_path <- system.file("extdata", "dada2_table.qza", package="file2meco")
sample_file_path <- system.file("extdata", "sample-metadata.tsv", package="file2meco")
taxonomy_file_path <- system.file("extdata", "taxonomy.qza", package="file2meco")
# construct microtable object
data1 <- qiime2meco(abund_file_path, sample_table = sample_file_path, taxonomy_table = taxonomy_file_path)
data1$tidy_dataset()
# revise the species names in tax_table as the information in the example file is not standard
select_rows <- data1$tax_table$Species != "s__"
data1$tax_table$Species[select_rows] <- paste0("s__", gsub("g__", "", data1$tax_table$Genus[select_rows]), " ", gsub("s__", "", data1$tax_table$Species[select_rows]))
# taxonomic abundance
data1$cal_abund()

# create object of trans_func
data2 <- trans_func$new(data1)
# Select NJC19 database
data2$cal_spe_func(prok_database = "NJC19")
# get the trait percentage data
data2$cal_spe_func_perc(abundance_weighted = FALSE)
# inset the trait percentage result into taxa_abund of microtable object
data1$taxa_abund$Trait <- as.data.frame(t(data2$res_spe_func_perc))
# use trans_abund to plot
t1 <- trans_abund$new(dataset = data1, taxrank = "Trait", ntaxa = 10, use_percentage = FALSE)
t1$plot_box(group = "donor_status") + ylab("Relative population abundance (%)") + theme(axis.text.x = element_text(size = 13))
```

```{r, out.width = "800px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_mouse_NJC19.png")
```


```{r, echo = TRUE, eval = FALSE}
# differential abundance test of the traits percentage and use random forest to find biomarkers
t1 <- trans_diff$new(dataset = data1, method = "rf", group = "donor_status", taxa_level = "Trait")
t1$plot_diff_bar(use_number = 1:30)
```

```{r, out.width = "800px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_mouse_NJC19_rf.png")
```


<!--chapter:end:08-Otherdata.Rmd-->

# Notes
We show some other important points here.

## `clone` function
R6 class has a special copy mechanism which is different from S3 and S4.
**If you want to copy an object completely, you should use the function `clone` instead of direct assignment.**

```{r, echo = TRUE}
# use clone to copy completely
t1 <- clone(dataset)
t2 <- clone(t1)
t2$sample_table <- NULL
identical(t2, t1)
```

```{r, echo = TRUE}
# this operation is usually unuseful, because changing t2 will also affect t1
t2 <- t1
t2$sample_table <- NULL
identical(t2, t1)
```


## `save` function

As an R6 object is an encapsulated environment, there is also another useful advantage, i.e. the data repeatability.
By applying R6 classes, the analyzed data in 'R Environment' can have minimum numbers of intermediate files.
It is also very convenient to save the analyzed data (i.e. object) to local computer with `save` function for back-up and communication with others.
This is also the recommended way to send data for testing in 'Github Issues'.

```{r, echo = TRUE, eval = FALSE}
# save 'dataset' to local computer
save(dataset, file = "dataset.RData")
# load the data by others in their R project
load("dataset.RData")
```

The `dataset.RData` can be compressed into zip format for the upload in 'Github Issues' (https://github.com/ChiLiubio/microeco/issues).
The saved object is only related with the package version when you save it and has no matter with the updated package when you load it again.
So it is repeatable.


## Github Issues

Please write the question/bug/suggestion in 'Github Issues' (https://github.com/ChiLiubio/microeco/issues).
Several points are recommended.

  + Before creating a 'New issue', please first skim through the 'Issues' including 'Open' and 'Closed'.
Then go ahead if no similar issue is found or the similar issue has not been addressed.

  + Please remember to provide your package version when writing a 'New issue'.

  + For the specific issue related with data, please provide the data and scripts for the reproduction.
If the data has not been published, please provide a toy example or a subset data as far as possible.

  + To provide the data, please use save function as previous part shows (https://chiliubio.github.io/microeco_tutorial/notes.html#save-function), 
and paste the compressed zip format data.

  + When pasting the script directly, please use \`\`\` at the beginning and end of scripts to mark the code block for the readability 
as \# can cause format chaos without code tag.


## change object
All the classes are set public, meaning that you can **change, add or remove the objects stored in them as you want**.

```{r, echo = TRUE}
# add a matrix you think useful
dataset$my_matrix <- matrix(1, nrow = 4, ncol = 4)
# change the information
dataset$sample_table %<>% .[, -2]
```

## group order
If the user want to reorder the groups, **assigning factors may be the simplest way.**

```{r, echo = TRUE}
data(dataset)
t1 <- trans_beta$new(dataset = dataset, measure = "bray")
t1$cal_ordination(ordination = "PCoA")
t1$plot_ordination(plot_color = "Group")
```

Then we assign factors to the 'Group' in sample_table of dataset.
We can find the changed group order in the legend and colors in the plot.

```{r, echo = TRUE}
dataset$sample_table$Group %<>% factor(., levels = c("IW", "TW", "CW"))
str(dataset$sample_table)
t1 <- trans_beta$new(dataset = dataset, measure = "bray")
t1$cal_ordination(ordination = "PCoA")
t1$plot_ordination(plot_color = "Group")
```


## add layers to plot
Most of the plots are generated by applying the ggplot2 package.
The important parameters in the plotting functions are configured according to our experience.
If the inner parameters can not enough, the user can add the layers to the plot like the following operation or
make the plot using the data (generally data.frame class) stored in the object.


```{r, echo = TRUE, eval = FALSE}
# The groupmean parameter can be used to obtain the group-mean barplot.
t1 <- trans_abund$new(dataset = dataset, taxrank = "Phylum", ntaxa = 10, groupmean = "Group")
g1 <- t1$plot_bar(others_color = "grey70", legend_text_italic = FALSE)
g1 + theme_classic() + theme(axis.title.y = element_text(size = 18))
```

```{r, out.width = "400px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_bar_mean_classic.png")
```

## colors for many groups

In general, the color palette in the ploting function is enough for the settings of groups or samples,
such as RColorBrewer::brewer.pal(12, "Paired").
If the user has too many groups (e.g. 20), the color_palette_20 can be used. It has 20 colors with paired combinations.
If this is still not enough, the randomcoloR package can be used for generating more colors.

```{r, echo = TRUE, eval = TRUE}
color_palette_20
```

```{r, echo = TRUE, eval = FALSE}
barplot(1:20, col = color_palette_20)
```

## customized data input

From v0.7.0, trans_alpha, trans_env and trans_venn classes can accept NULL dataset input for some customized usage of some functions in those classes.

```{r, echo = TRUE, eval = FALSE}
t1 <- trans_env$new()
t1 <- trans_alpha$new(dataset = NULL, group = NULL)
```


<!--chapter:end:09-Notes.Rmd-->

# mecodev package
The mecodev package (https://github.com/ChiLiubio/mecodev/) is designed for more extended analysis approaches based on the microeco package.

## trans_rarefy
The class trans_rarefy in mecodev package can be used for the rarefaction and the following plotting to see whether 
the sequencing depth is enough to cover all the so-called species in the microbial community.

```{r, echo = TRUE, eval = FALSE}
library(microeco)
library(mecodev)
data(sample_info_16S)
data(otu_table_16S)
# set.seed is used to fix the random number generation to make the results repeatable
set.seed(123)
dataset <- microtable$new(sample_table = sample_info_16S, otu_table = otu_table_16S)
dataset$tidy_dataset()
# trans_rarefy class
t1 <- trans_rarefy$new(dataset, alphadiv = "Shannon", depth = c(0, 10, 50, 500, 2000, 4000, 6000, 8000))
t1$plot_rarefy(color = "Group", show_point = FALSE, add_fitting = TRUE)
t1$plot_rarefy(color_values = rep("grey", 100), show_point = TRUE, add_fitting = FALSE, show_legend = FALSE)
```

```{r, out.width = "550px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_trans_rarefy.png")
```


## trans_norm
The class trans_norm provide several data normalization approaches for the microtable object.
The output is also a microtable object.

```{r, echo = TRUE, eval = FALSE}
library(microeco)
library(mecodev)
data(dataset)
test1 <- trans_norm$new(dataset = dataset)
# Centered log-ratio normalization
test2 <- test1$norm(method = "CLR")
# returned test2 is another microtable object
# Cumulative sum scaling normalization. Require metagenomeSeq package to be installed.
test2 <- test1$norm(method = "CCS")
# log transformation
test2 <- test1$norm(method = "log")
```


## trans_ts

The class trans_ts is designed for the time series data analysis.
A commonly used approach for modeling microbial ecology for time series data is the generalized Lotka-Volterra (gLV) model, the classical predator-prey systems.
gLV models are based on ordinary differential equations that model the logistic growth of species; 
naturally capture predator-prey, amensalistic, and competitive interactions; and have been applied to study dynamics of microbial ecosystems.
More importantly, from a practical perspective, gLV models have been used for a range of applications including identifying potential probiotics
against pathogens, forecasting changes in microbial density, characterizing important community members (e.g., keystone species), 
and analyzing community stability (see [@Li_expectation_2019] and the references therein).
Currently, the biomass estimation and biological interaction prediction approaches are implemented based on the beem package [@Li_expectation_2019].
The example data 'gut_microb_ts' comes from the article [@Gibbons_twodynamic_2017].

```{r, echo = TRUE, eval = FALSE}
# install the necessary packages
# For windows system:
install.packages("doMC", repos = "http://R-Forge.R-project.org")
# For linux or mac
install.packages("doMC")
# Then install the following packages
install.packages("lokern")
install.packages("monomvn")
install.packages("pspline")
devtools::install_github('csb5/beem')
```

```{r, echo = TRUE, eval = FALSE}
library(mecodev)
# load the example data in mecodev package; the input must be a microtable object
# There are several strict requirements on the sample_table; see the document of the class.
data("gut_microb_ts")
# generally, using filter_thres to filter the taxa with low abundance is crutial
# there are only 22 taxa in the example data, we use 0
t1 <- trans_ts$new(dataset = gut_microb_ts, filter_thres = 0.005)
# use minimal 30 times for iteration
t1$cal_biomass(min_iter = 30)
# return t1$res_biomass and t1$res_param
# t2 is a trans_network object
t2 <- t1$cal_network()
# use cluster_optimal; as the default cluster_fast_greedy can not be used for the directed network
t2$cal_module(method = "cluster_optimal")
```


## trans_gamma

The class trans_gamma is developed to explore the relationship between gamma diversity and beta diversity 
based on the methods from biogeographic studies[@Zhang_Local_2020].
Currently, the contents include the observed beta-gamma diversity relationship, simulated beta-gamma diversity relationship and the following plotting.
If the observed gamma diversity and beta diversity are significantly correlated,
species pool at regional scale (or maybe your defined scale, e.g., different treatments in the lab) can have large effect on the beta diversity.
Thus, species pool should be first considered to explain beta diversity patterns.
This class also provide simulation function to explore the relation between gamma diversity and beta diversity in the absence of any process
other than random sampling based on the species log-normal distribution.
We use the wetland data to show the observed beta-gamma diversity relationship.


```{r, echo = TRUE, eval = FALSE}
library(microeco)
library(mecodev)
# load the example data
data(sample_info_16S)
data(otu_table_16S)
test <- microtable$new(sample_table = sample_info_16S, otu_table = otu_table_16S)
test$tidy_dataset()
test$rarefy_samples(sample.size = 10000)
# then create trans_gamma object
test1 <- trans_gamma$new(dataset = test, group = "Type", method = "bray")
test1$cal_observed(sample_size = NULL)
test1$res_observed
# use Spearman correlation
test1$plot_observed(cor_method = "spearman")
```

```{r, out.width = "550px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_gamma_obs.png")
```

Let's simulate the relation between gamma diversity and beta diversity in the absence of any process
other than random sampling based on the species log-normal distribution.

```{r, echo = TRUE, eval = FALSE}
# if you only run the simulation, dataset parameter is not necessary
test1 <- trans_gamma$new(method = "bray")
# use individul numbers at 200, 1000 and 2000, and hypothesize each species pool have 20 samples.
test1$cal_simulation(ncom = 20, ind_vect = c(200, 1000, 2000))
test1$plot_simulation(add_fitting = FALSE)
```

```{r, out.width = "600px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_gamma_simu.png")
```


<!--chapter:end:10-mecodev.Rmd-->

# meconetcomp package

To facilitate microbial co-occurrence network comparison, 
R package meconetcomp (https://github.com/ChiLiubio/meconetcomp) was developed.
The basic strategy is to use R list to put into all the trans_network objects and perform the following analysis.
The flow chart and following contents are adapted from the published article [@Liu_meconetcomp_2023].
For more details, please see the online paper (https://doi.org/10.1002/imt2.71).


```{r, out.width = "650px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/meconetcomp.png")
```

```{r, echo = TRUE, eval = FALSE}
# install the required packages
# aplot: one dependency of the trans_venn class of microeco package
# agricolae: for Duncan's new multiple range test
packages <- c("meconetcomp", "rgexf", "pheatmap", "aplot", "agricolae")
# Now check or install
for(x in packages){
	if(!require(x, character.only = TRUE)) {
		install.packages(x, dependencies = TRUE)
	}
}
```

```{r, echo = TRUE, eval = FALSE}
library(microeco)
library(meconetcomp)
# use pipe operator in magrittr package
library(magrittr)
library(igraph)
library(ggplot2)
# load soil amplicon sequencing dataset
data(soil_amp)
```

First reconstruct three correlation networks for the three groups 'IW', 'TW' and 'CW' as the example.

```{r, echo = TRUE, eval = FALSE}
# first create a list
soil_amp_network <- list()
# select samples of "IW" group
# use clone to get a deep copy of soil_amp (R6 object)
tmp <- clone(soil_amp)
# change sample_table directly
tmp$sample_table %<>% subset(Group == "IW")
# trim all files in the object
tmp$tidy_dataset()
# use filter_thres parameter to filter the feature with low relative abundance
tmp <- trans_network$new(dataset = tmp, cor_method = "spearman", filter_thres = 0.0005)
# COR_p_thres represents the p value threshold
# COR_cut denotes the correlation coefficient threshold
tmp$cal_network(COR_p_thres = 0.01, COR_cut = 0.6)
# put the network into the list
soil_amp_network$IW <- tmp
# select samples of "TW" group
tmp <- clone(soil_amp)
tmp$sample_table %<>% subset(Group == "TW")
tmp$tidy_dataset()
tmp <- trans_network$new(dataset = tmp, cor_method = "spearman", filter_thres = 0.0005)
tmp$cal_network(COR_p_thres = 0.01, COR_cut = 0.6)
soil_amp_network$TW <- tmp
# select samples of "CW" group
tmp <- clone(soil_amp)
tmp$sample_table %<>% subset(Group == "CW")
tmp$tidy_dataset()
tmp <- trans_network$new(dataset = tmp, cor_method = "spearman", filter_thres = 0.0005)
tmp$cal_network(COR_p_thres = 0.01, COR_cut = 0.6)
soil_amp_network$CW <- tmp
# Now we have the list soil_amp_network
```


## Network modularity for all networks

The function cal_module in meconetcomp package is designed to partition modules for all the networks in the list.

```{r, echo = TRUE, eval = FALSE}
soil_amp_network %<>% cal_module(undirected_method = "cluster_fast_greedy")
```

## Network topological attributes for all networks

we extracted all the res_network_attr tables in the networks and merged them into one final table by using cal_network_attr function in meconetcomp package.

```{r, echo = TRUE, eval = FALSE}
tmp <- cal_network_attr(soil_amp_network)
# tmp is a data.frame object
```

## Node and edge properties extraction for all networks

The get_node_table and get_edge_table functions of meconetcomp package can be used to directly extract node and edge properties for all the networks.
The return table is stored in each network object.

```{r, echo = TRUE, eval = FALSE}
soil_amp_network %<>% get_node_table(node_roles = TRUE) %>% get_edge_table
```

## Compare nodes across networks

The nodes in all the networks can be converted to a new microtable object by using the node_comp function of meconetcomp package.
Then, it is easy to analyse the nodes overlap with trans_venn class.

```{r, echo = TRUE, eval = FALSE}
# obtain the node distributions by searching the res_node_table in the object
tmp <- node_comp(soil_amp_network, property = "name")
# obtain nodes intersection
tmp1 <- trans_venn$new(tmp, ratio = "numratio")
g1 <- tmp1$plot_venn(fill_color = FALSE)
ggsave("soil_amp_node_overlap.pdf", g1, width = 7, height = 6)
# calculate jaccard distance to reflect the overall differences of networks
tmp$cal_betadiv(method = "jaccard")
tmp$beta_diversity$jaccard
```

## Compare edges across networks

The pipeline of studying edges overlap is similar with the above operations of nodes comparison.
The edge_comp function of meconetcomp package is used to convert edges distribution to a new microtable object.

```{r, echo = TRUE, eval = FALSE}
# get the edge distributions across networks
tmp <- edge_comp(soil_amp_network)
# obtain edges intersection
tmp1 <- trans_venn$new(tmp, ratio = "numratio")
g1 <- tmp1$plot_venn(fill_color = FALSE)
ggsave("soil_amp_edge_overlap.pdf", g1, width = 7, height = 6)
# calculate jaccard distance
tmp$cal_betadiv(method = "jaccard")
tmp$beta_diversity$jaccard
```

## Extract overlapped edges of networks to a new network

Then we extracted the subset of edges according to the intersections of edges across networks,
which can be accomplished with the subset_network function in meconetcomp package.


```{r, echo = TRUE, eval = FALSE}
# first obtain edges distribution and intersection
tmp <- edge_comp(soil_amp_network)
tmp1 <- trans_venn$new(tmp)
# convert intersection result to a microtable object
tmp2 <- tmp1$trans_comm()
# extract the intersection of all the three networks ("IW", "TW" and "CW")
# please use colnames(tmp2$otu_table) to find the required name
Intersec_all <- subset_network(soil_amp_network, venn = tmp2, name = "IW&TW&CW")
# Intersec_all is a trans_network object
# for example, save Intersec_all as gexf format
Intersec_all$save_network("Intersec_all.gexf")
```

## Compare phylogenetic distances of paired nodes in edges

The edge_node_distance class (R6 class) in meconetcomp package is designed to compare the distribution of distance values of paired nodes in all the edges across networks.
Here, we indicated the phylogenetic distance distributions and performed the differential test among networks.
The input parameter dis_matrix can be any symmetric matrix with both the column names and row names (i.e. feature names).
So it is also feasible to compare other properties of features, such as Levin's niche overlap.


```{r, echo = TRUE, eval = FALSE}
# filter useless features to speed up the calculation
node_names <- unique(unlist(lapply(soil_amp_network, function(x){colnames(x$data_abund)})))
filter_soil_amp <- microeco::clone(soil_amp)
filter_soil_amp$otu_table <- filter_soil_amp$otu_table[node_names, ]
filter_soil_amp$tidy_dataset()
# obtain phylogenetic distance matrix
phylogenetic_distance_soil <- as.matrix(cophenetic(filter_soil_amp$phylo_tree))
# use both the positive and negative labels
tmp <- edge_node_distance$new(network_list = soil_amp_network, dis_matrix = phylogenetic_distance_soil, label = c("+", "-"))
tmp$cal_diff(method = "anova")
# visualization
g1 <- tmp$plot(boxplot_add = "none", add_sig = TRUE, add_sig_text_size = 5) + ylab("Phylogenetic distance")
ggsave("soil_amp_phylo_distance.pdf", g1, width = 7, height = 6)

# show different modules with at least 10 nodes and positive edges
tmp <- edge_node_distance$new(network_list = soil_amp_network, dis_matrix = phylogenetic_distance_soil, 
	label = "+", with_module = TRUE, module_thres = 10)
tmp$cal_diff(method = "anova")
g1 <- tmp$plot(boxplot_add = "none", add_sig = TRUE, add_sig_text_size = 5) + ylab("Phylogenetic distance")
ggsave("soil_amp_phylo_distance_modules.pdf", g1, width = 8, height = 6)
```


## Compare node sources of edges across networks

To know which taxa constitute the nodes in edges is important in understanding species co-occurrence patterns and answering ecological questions.
In this part, as an instance, we used edge_tax_comp function of meconetcomp package to get the sums of node sources (at Phylum level) in the positive edges.
In other words, how many linked nodes of positive edges come from different phyla or the same phyla.
Then, to make the results comparable, the ratio was calculated with the positive edge number as denominator.

```{r, echo = TRUE, eval = FALSE}
soil_amp_network_edgetax <- edge_tax_comp(soil_amp_network, taxrank = "Phylum", label = "+", rel = TRUE)
# filter the features with small number
soil_amp_network_edgetax <- soil_amp_network_edgetax[apply(soil_amp_network_edgetax, 1, mean) > 0.01, ]
# visualization
g1 <- pheatmap::pheatmap(soil_amp_network_edgetax, display_numbers = TRUE)
ggsave("soil_amp_edge_tax_comp.pdf", g1, width = 7, height = 7)
```

## Compare topological properties of sub-networks

In this part, we extracted the sub-networks according to the OTU existed in each sample of soil_amp dataset for each network in soil_amp_network.
Then, the global topological properties of sub-networks were calculated.
All the operations were encapsulated into the subnet_property function of meconetcomp package.

```{r, echo = TRUE, eval = FALSE}
# calculate global properties of all sub-networks
tmp <- subnet_property(soil_amp_network)
# then prepare the data for the correlation analysis
# use sample names (second column) as rownames
rownames(tmp) <- tmp[, 2]
# delete first two columns (network name and sample name)
tmp <- tmp[, -c(1:2)]
# load ready-made abiotic factor and diversity table
data(soil_measure_diversity)
tmp1 <- trans_env$new(dataset = soil_amp, add_data = soil_measure_diversity)
tmp1$cal_cor(use_data = "other", by_group = "Group", add_abund_table = tmp, cor_method = "spearman")
# generate correlation heatmap
g1 <- tmp1$plot_cor()
ggsave("soil_amp_subnet_property.pdf", g1, width = 11, height = 5)
```


<!--chapter:end:11-meconetcomp.Rmd-->

# mecoturn package

The mecoturn package is designed to study microbial turnover along plant compartments originally (https://github.com/ChiLiubio/mecoturn).
Generally speaking, it can be used in any case associated with a gradient. 


```{r, echo = TRUE, eval = FALSE}
# install the required packages
packages <- c("mecoturn", "agricolae", "lmerTest", "betareg")
for(x in packages){
	if(!require(x, character.only = TRUE)) {
		install.packages(x, dependencies = TRUE)
	}
}
```

## betaturn class

The betaturn class is used to study the turnover of communities, i.e. beta diversity along a gradient [@Anderson_Navigating_2011].
The pipeline consists of the steps of dissimilarity matrix generation, matrix conversion, differential test and visualization.
Creating betaturn object can generate the dissimilarity matrix stored in the object.
The example data `wheat_16S` is the amplicon sequencing data of 16S rDNA of wheat samples from different regions.
In `wheat_16S$sample_table`, the `Type` column represents the niches, i.e. plant compartments, including S (bulk soil), RS (rhizosphere soil) and R (root endophyte).
The `Plant_ID` column denotes each unique plant individual that the samples belong to.
The `Region` column means the sampling region of those plants. Each region has five plant individuals.

```{r, echo = TRUE, eval = FALSE}
library(mecoturn)
library(magrittr)
data(wheat_16S)
# see the help document of betaturn
?betaturn
# use 'bray', i.e. Bray-Curtis dissimilarity, as the example
b1 <- betaturn$new(wheat_16S, measure = "bray")
```

```{r, echo = FALSE, eval = TRUE}
suppressMessages(library(mecoturn))
library(magrittr)
data(wheat_16S)
b1 <- betaturn$new(wheat_16S, measure = "bray")
```

The matrix conversion is the core step in the betaturn class.
It has two cases: `within_group = TRUE` and `within_group = FALSE`.
The case `within_group = FALSE` means the distance conversion is carried out between any two elements of `Type` column (`group` parameter).
The parameter `by_group` is used to constrain the conversion, i.e. the conversion is performed for each plant.

```{r, echo = TRUE, eval = TRUE}
b1$cal_group_distance(group = "Type", within_group = FALSE, by_group = "Plant_ID")
head(b1$res_group_distance)
```

The combinations of `Type` may be not well-organized. We use `ordered_group` parameter to sort them and `sep` parameter to assign the direction.

```{r, echo = TRUE, eval = TRUE}
b1$cal_group_distance(group = "Type", ordered_group = c("S", "RS", "R"), within_group = FALSE, by_group = "Plant_ID", sep = " -> ")
head(b1$res_group_distance)
```

The `cal_group_distance_diff` function can be used to perform the statistical analysis.

```{r, echo = TRUE, eval = FALSE}
# Wilcoxon rank sum test
b1$cal_group_distance_diff(method = "wilcox")
# paired Wilcoxon test by adding by_ID
b1$cal_group_distance_diff(method = "wilcox", by_ID = "Plant_ID")
# box plot
b1$plot_group_distance()
```

The parameter `add_cols` can be used to add other columns in `sample_table` to the final table when `within_group = FALSE`.
This is especially useful when the statistical analysis will be performed for different groups.

```{r, echo = TRUE, eval = FALSE}
b1$cal_group_distance(group = "Type", ordered_group = c("S", "RS", "R"), within_group = FALSE, by_group = "Plant_ID", sep = " -> ", add_cols = "Region")
head(b1$res_group_distance)
# Wilcoxon test for each region
b1$cal_group_distance_diff(method = "wilcox", by_group = "Region")
b1$plot_group_distance()
```

Then let's use linear mixed-effects model to check the random effects of plants.
For the usages of the `formula` in linear mixed-effects model (lme4 package), see https://mspeekenbrink.github.io/sdam-r-companion/linear-mixed-effects-models.html

```{r, echo = TRUE, eval = FALSE}
b1$cal_group_distance(group = "Type", ordered_group = c("S", "RS", "R"), within_group = FALSE, by_group = "Plant_ID", sep = " -> ", add_cols = "Region")
# random intercept and slope of Region
b1$cal_group_distance_diff(method = "lme", formula = "Type + (1 + Type| Region)")
# random intercept of Plant_ID
b1$cal_group_distance_diff(method = "lme", formula = "Type + (1 | Plant_ID)")
# random intercept of Plant_ID nested in Region
b1$cal_group_distance_diff(method = "lme", formula = "Type + (1 | Region/Plant_ID)")
b1$res_group_distance_diff
```

Another case `within_group = TURE` means the distance conversion is performed according to each element of the group.
For example, when `group = "Region"`, any two samples within an element of `"Region"` column will be combined.
In this scenario, by_group parameter is designed to generate the additional combinations related to other columns.

```{r, echo = TRUE, eval = FALSE}
b1$cal_group_distance(within_group = TRUE, group = "Region", by_group = c("Type", "Plant_ID"))
head(b1$res_group_distance)
# filter the combinations not coming from the same plant
all_plants <- unique(wheat_16S$sample_table$Plant_ID)
b1$res_group_distance %<>% .[.$Plant_ID %in% paste0(all_plants, " vs ", all_plants), ]
# compare values across regions
b1$cal_group_distance_diff(method = "anova")
b1$plot_group_distance(add_sig_text_size = 5.6, xtext_angle = 30)
```

## taxaturn class

The taxaturn class refers to microbial abundance change along a gradient.
Its workflow includes the taxonomic abundance calculation, abundance transformation, abundance change summary, statistical analysis and visualization.

```{r, echo = TRUE, eval = TRUE}
# study the abundance change of phyla by comparing the mean abundances without the consideration of plants
t1 <- taxaturn$new(wheat_16S, taxa_level = "Phylum", group = "Type", ordered_group = c("S", "RS", "R"), by_ID = NULL, by_group = NULL, filter_thres = 0.001)
head(t1$res_change)
t1$cal_diff(method = "wilcox", p_adjust_method = NULL)
head(t1$res_change)
# compare the abundance change by summarizing the results for each plant individual
t1 <- taxaturn$new(wheat_16S, taxa_level = "Phylum", group = "Type", ordered_group = c("S", "RS", "R"), by_ID = "Plant_ID", by_group = NULL, filter_thres = 0.001)
head(t1$res_change)
# paired Wilcoxon test
t1$cal_diff(method = "wilcox", p_adjust_method = NULL)
head(t1$res_change)
```

The upper operations can also be applied to different groups.

```{r, echo = TRUE, eval = FALSE}
# for each region without the consideration of plants
t1 <- taxaturn$new(wheat_16S, taxa_level = "Phylum", group = "Type", ordered_group = c("S", "RS", "R"), by_ID = NULL, by_group = "Region", filter_thres = 0.001)
head(t1$res_change)
t1$cal_diff(method = "wilcox", p_adjust_method = NULL)
head(t1$res_change)
# for each region with the consideration of plants
t1 <- taxaturn$new(wheat_16S, taxa_level = "Phylum", group = "Type", ordered_group = c("S", "RS", "R"), by_ID = "Plant_ID", by_group = "Region", filter_thres = 0.001)
head(t1$res_change)
# paired Wilcoxon test
t1$cal_diff(method = "wilcox", p_adjust_method = NULL)
head(t1$res_change)
t1$plot(point_size = 2)
```

Then let's show the linear mixed-effects models.

```{r, echo = TRUE, eval = FALSE}
t1 <- taxaturn$new(wheat_16S, taxa_level = "Phylum", group = "Type", ordered_group = c("S", "RS", "R"), by_ID = "Plant_ID", by_group = "Region")
t1$cal_diff(method = "lme", formula = "Type + (1 + Type | Region)", measure = "p__Proteobacteria")
t1$res_diff
```

For the relative abundance fitting, common parametric tests like anova is not a good way since the dependent variable is proportion data, 
which is not normally distributed or homoscedastic generally.
Beta regression is suitable for the proportion data (bound at 0 and 1).
It can be conducted with the betareg function in the betareg package [@Betaregression_2010] and has been implemented in the mecoturn package.


```{r, echo = TRUE, eval = FALSE}
# for all samples
t1 <- taxaturn$new(wheat_16S, taxa_level = "Phylum", group = "Type", ordered_group = c("S", "RS", "R"), by_ID = "Plant_ID", by_group = NULL, filter_thres = 0.001)
t1$cal_diff(method = "betareg", formula = "Type")
t1$res_diff
```

```{r, echo = TRUE, eval = FALSE}
# for different regions
t1 <- taxaturn$new(wheat_16S, taxa_level = "Phylum", group = "Type", ordered_group = c("S", "RS", "R"), by_ID = "Plant_ID", by_group = "Region", filter_thres = 0.001)
# measure can designate a taxon
t1$cal_diff(method = "betareg", formula = "Type", measure = "p__Proteobacteria")
t1$res_diff
t1$plot(point_size = 2, select_taxon = "p__Proteobacteria", plot_type = c("point", "smooth"), rect_fill = FALSE)
```


<!--chapter:end:12-mecoturn.Rmd-->

# Other examples

We've gathered a few tricks to make several issues easier to solve.
We encourage users to contribute some unique, special or helpful examples inspired by the packages.


## Custom taxa order in bar plot

The flexibility of the package design can be reflected on many aspects.
In some cases, it is possible to tactfully handle some particular challenges.
For example, Dr.Jarrod contribute a cool answer to the question that how to use custom taxa and the order in bar plot by modifying the data inside the object.
This is a discussion topic in microeco Discussions part. Here is the link (https://github.com/ChiLiubio/microeco/discussions/45).


## The importance of tidy_taxonomy function

The taxonomic classification with standard prefix is very important for some analyses,
e.g. taxonomic abundance plotting and biomarker finding.
The tidy_taxonomy function in microeco package is designed to make the taxa having standard prefix.
See those Issues for the detailed examples: (https://github.com/ChiLiubio/microeco/issues/32) and (https://github.com/ChiLiubio/microeco/issues/22).


## Show the abundance of unknown taxa

In the trans_abund class, the default operation can filter the unclassified taxa, such as those with 'g__'.
Sometimes, these unknown taxa may be useful to represent the uncultured species.
Under the circumstances, the abundance of the unknown taxa can be used to show the ratio of uncultured taxa in the system.
Please see the solution in this topic of Issues (https://github.com/ChiLiubio/microeco/issues/165).


## Question of prefix in the taxa

The prefix of taxa in taxonomic table may affect the following performance of plotting, e.g. text in legend.
Please see those Issues (https://github.com/ChiLiubio/microeco/issues/32), (https://github.com/ChiLiubio/microeco/issues/7) 
and (https://github.com/ChiLiubio/microeco/issues/15).


## The use of phylogenetic tree

One of Issues referred to the basic use of phylogenetic tree in the microeco package (https://github.com/ChiLiubio/microeco/issues/33).

## Mantel test + Correlation heatmap

Sometimes, mantel test results and correlations of environmental factors are visualized together to deliver more information.
In this example, we show how to extend the mantel test results from `trans_env` class with the package ggcor.
Please install `ggcor` package according to the document (https://github.com/mj163163/ggcor-1).

```{r, echo = TRUE, eval = FALSE}
# prepare data
library(microeco)
library(magrittr)
data(dataset)
data(env_data_16S)
dataset$sample_table <- data.frame(dataset$sample_table, env_data_16S[rownames(dataset$sample_table), ])
# extract two phyla to show the steps
d1 <- clone(dataset)
d1$tax_table <- d1$tax_table[d1$tax_table$Phylum == "p__Proteobacteria", ]
d1$tidy_dataset()
d1$cal_betadiv()
d2 <- clone(dataset)
d2$tax_table <- d2$tax_table[d2$tax_table$Phylum == "p__Bacteroidetes", ]
d2$tidy_dataset()
d2$cal_betadiv()
# first perform mantel test
t1 <- trans_env$new(dataset = d1, env_cols = 8:15)
t1$cal_mantel(use_measure = "bray", partial_mantel = TRUE)
t2 <- trans_env$new(dataset = d2, env_cols = 8:15)
t2$cal_mantel(use_measure = "bray", partial_mantel = TRUE)
# extract a part of the results 
x1 <- data.frame(spec = "Proteobacteria", t1$res_mantel) %>% .[, c(1, 3, 6, 8)]
x2 <- data.frame(spec = "Bacteroidetes", t2$res_mantel) %>% .[, c(1, 3, 6, 8)]
# rename columns
colnames(x1) <- colnames(x2) <- c("spec", "env", "r", "p.value")
# generate interval data
x1 %<>% dplyr::mutate(rd = cut(r, breaks = c(-Inf, 0.3, 0.6, Inf), labels = c("< 0.3", "0.3 - 0.6", ">= 0.6")),
	pd = cut(p.value, breaks = c(-Inf, 0.01, 0.05, Inf), labels = c("< 0.01", "0.01 - 0.05", ">= 0.05")))
x2 %<>% dplyr::mutate(rd = cut(r, breaks = c(-Inf, 0.3, 0.6, Inf), labels = c("< 0.3", "0.3 - 0.6", ">= 0.6")),
	pd = cut(p.value, breaks = c(-Inf, 0.01, 0.05, Inf), labels = c("< 0.01", "0.01 - 0.05", ">= 0.05")))

# cobine two tables
plot_table <- rbind(x1, x2)
# require ggcor package (https://github.com/mj163163/ggcor-1)
library(ggplot2)
library(ggcor)
set_scale()

g1 <- quickcor(t1$data_env, type = "upper") +
	geom_square() +
	anno_link(aes(colour = pd, size = rd), data = plot_table) +
	scale_size_manual(values = c(0.5, 1.5, 3)) +
	scale_colour_manual(values = c("#D95F02", "#1B9E77", "#A2A2A288")) +
	guides(size = guide_legend(title = "Mantel's r", override.aes = list(colour = "grey35"), order = 2),
		colour = guide_legend(title = "Mantel's p", override.aes = list(size = 3), order = 1),
		fill = guide_colorbar(title = "Pearson's r", order = 3))

g1
```

```{r, out.width = "650px", fig.align="center", echo = FALSE}
knitr::include_graphics("Images/plot_env_mantel_ggcor.png")
```


<!--chapter:end:13-Other_examples.Rmd-->

`r if (knitr:::is_html_output()) '
# References {-}
'`

<!--chapter:end:14-references.Rmd-->