02-basic_class.Rmd

# Basic class
The microtable class is the basic class.
All the other classes depend on the microtable class.

## microtable class

　Many tools can be used for the bioinformatic analysis of amplicon sequencing data, such as QIIME [@Caporaso_QIIME_2010], QIIME2 [@Bolyen_Reproducible_2019], 
usearch (https://www.drive5.com/usearch/), mothur [@Schloss_Introducing_2009], 
SILVAngs (https://ngs.arb-silva.de/silvangs/), 
and RDP (http://rdp.cme.msu.edu/).
Although the formats of results may be distinctive across various tools, the main files can be generally classified into the following parts:
(1) OTU/ASV table, i.e. the species-sample abundance table; 
(2) taxonomy table, the taxonomic assignment table;
(3) representative sequences;
(4) phylogenetic tree;
(5) metadata. It is generally useful to create a detailed sample information table to store all the sample information, 
including the environmental data.

　The microtable class is the basic class and designed to store the basic data for all the downstream analysis in the microeco package.
At least, the OTU table (i.e. species-sample abundance table) should be provided for creating microtable object.
Thus, the microtable class can recognize the sample information table is missing and create a default sample table according to
the sample names in otu_table.
To make the file reading more convenient,
we also build another R package file2meco (https://github.com/ChiLiubio/file2meco) to read the output files of some tools into microtable object.
Currently, those tools/softwares include not only commonly-used QIIME [@Caporaso_QIIME_2010] and QIIME2[@Bolyen_Reproducible_2019], 
but also some metagenomic tools, such as HUMAnN [@Franzosa_Species_2018] and kraken2 [@Wood_Improved_2019].
In this tutorial, we use the data inside the package microeco to show some basic operations.

### Example

　The 16S rRNA gene sequencing results in the example data of the package is used to show the main part of the tutorial.
This dataset is the 16S rRNA gene Miseq sequencing results of wetland soils in China published by An et al. [@An_Soil_2019], 
who surveyed soil bacterial communities in Chinese inland wetlands (IW), 
coastal wetland (CW) and Tibet plateau wetlands (TW) using amplicon sequencing.
These wetlands include both saline and non-saline samples.
The sample information table have 4 columns: "SampleID", "Group", "Type" and "Saline".
The column "SampleID" is same with the rownames.
The column "Group" represents the IW, CW and TW.
The column "Type" represents the sampling region: northeastern region (NE), northwest region (NW), North China area (NC),
middle-lower reaches of the Yangtze River (YML), southern coastal area (SC), upper reaches of the Yangtze River (YU), Qinghai-Tibet Plateau (QTP). 
The column "Saline" represents the saline soils and non-saline soils.
In this dataset, the environmental factor table is separated from the sample information table.
It is of course doable to put all the environmental data into sample information table.


```{r, echo = TRUE}
library(microeco)
# load the example data; 16S rRNA gene amplicon sequencing dataset
data(sample_info_16S)
data(otu_table_16S)
data(taxonomy_table_16S)
# use phylogenetic tree to calculate phylogeny-based alpha and beta metrics
data(phylo_tree_16S)
# load the environmental data which is detached from sample table
data(env_data_16S)
# use pipe operator in magrittr package
library(magrittr)
# set.seed is used to fix the random number generation to make the results repeatable
set.seed(123)
# make the plotting background same with the tutorial
library(ggplot2)
theme_set(theme_bw())
```

Make sure that the data types of sample_table, otu_table and tax_table are all data.frame as the following part shows.

```{r, echo = TRUE}
class(otu_table_16S)
```

```{r, echo = TRUE, eval = FALSE}
otu_table_16S[1:5, 1:5]
```

```{r, echo = FALSE}
pander::pander(otu_table_16S[1:5, 1:5])
```

```{r, echo = TRUE}
class(taxonomy_table_16S)
```

```{r, echo = TRUE, eval = FALSE}
taxonomy_table_16S[1:5, 1:3]
```

```{r, echo = FALSE}
pander::pander(taxonomy_table_16S[1:5, 1:3])
```

Generally, users' taxonomic table may have some chaotic information, such as NA, unidentified and unknown.
These information can potentially influence the following taxonomic abundance calculation and other taxonomy-based analysis.
So it is usually necessary to clean this data using the **tidy_taxonomy** function.
Another very important result of this operation is to **unify the taxonomic prefix** automatically,
e.g. transforming D_1__ to p__ for phylum level or adding p__ to phylum directly.


```{r, echo = TRUE, eval = FALSE}
# make the taxonomic information unified, very important
taxonomy_table_16S %<>% tidy_taxonomy
```

The rownames of sample_table in microtable object (i.e. sample names) are used for selecting samples/groups in all the related operations in the package.
**Before creating microtable object, make sure that the rownames of sample information table are sample names**.


```{r, echo = TRUE}
class(sample_info_16S)
```

```{r, echo = TRUE, eval = FALSE}
sample_info_16S[1:5, ]
```

```{r, echo = FALSE}
pander::pander(sample_info_16S[1:5, ])
```

In this example, the environmental data is stored in the env_data_16S alone. 
The user can also directly integrate those data into the sample information table.

```{r, echo = TRUE}
class(env_data_16S)
```

```{r, echo = FALSE}
pander::pander(env_data_16S[1:5, 1:5])
```

```{r, echo = TRUE}
class(phylo_tree_16S)
```

Then, we create an object of microtable class.
This operation is very similar with the package phyloseq[@Mcmurdie_phyloseq_2013], but in microeco it is more brief.
The otu_table in the microtable class must be the species-sample format: rownames - OTU/ASV/other names; colnames - sample names.
**The colnames in otu_table must have overlap with rownames of sample_table**.
Otherwise, the following check can filter all the samples of otu_table because of no same sample names between otu_table and sample_table.


```{r, echo = TRUE}
# In R6 class, '$new' is the original method used to create a new object of class
# If you only provide abundance table, the class can help you create a sample info table
dataset <- microtable$new(otu_table = otu_table_16S)
class(dataset)
# generally add the metadata
dataset <- microtable$new(otu_table = otu_table_16S, sample_table = sample_info_16S)
dataset
# Let's create a microtable object with more information
dataset <- microtable$new(sample_table = sample_info_16S, otu_table = otu_table_16S, tax_table = taxonomy_table_16S, phylo_tree = phylo_tree_16S)
dataset
```

If the users want to know more details on microtable class, 
please see the help document of the class using the following help command.
For example, see the phylo_tree parameter of microtable$new() for reading phylogenetic tree.

```{r, echo = TRUE, eval = FALSE}
# search the class name, not the function name
?microtable
```


Then, we remove OTUs which are not assigned in the Kingdom "k__Archaea" or "k__Bacteria". 
```{r, echo = TRUE}
dataset$tax_table %<>% base::subset(Kingdom == "k__Archaea" | Kingdom == "k__Bacteria")
dataset
```

We also remove OTUs with the taxonomic assignments "mitochondria" or "chloroplast".

```{r, echo = TRUE}
# This will remove the lines containing the taxa word regardless of taxonomic ranks and ignoring word case in the tax_table.
# So if you want to filter some taxa not considerd pollutions, please use subset like the previous operation to filter tax_table.
dataset$filter_pollution(taxa = c("mitochondria", "chloroplast"))
print(dataset)
```


To make the OTU and sample information consistent across all files in the dataset object, we use function **tidy_dataset** to trim the dataset.


```{r, echo = TRUE}
dataset$tidy_dataset()
print(dataset)
```


Then let's use sample_sums() to check the sequence numbers in each sample.

```{r, echo = TRUE}
dataset$sample_sums() %>% range
```


Sometimes, in order to reduce the effects of sequencing depth on the diversity measurements, 
it is optional perform the resampling to make the sequence number equal for each sample.
The function rarefy_samples can invoke the function tidy_dataset automatically before and after the rarefying.

```{r, echo = TRUE}
# As an example, use 10000 sequences in each sample
dataset$rarefy_samples(sample.size = 10000)
dataset$sample_sums() %>% range
```

Then, let's calculate the taxa abundance at each taxonomic rank using cal_abund().
This function **return a list called taxa_abund stored in the microtable object**. 
This list contain several data frame of the abundance information at each taxonomic rank.
It's worth noting that the cal_abund() function can be used to **solve more complicated cases with special parameters**,
such as supporting both the relative and absolute abundance calculation and selecting the partial 'taxonomic' columns.
Those have been shown in file2meco package part (https://chiliubio.github.io/microeco_tutorial/file2meco-package.html#humann-metagenomic-results) with complex metagenomic dataset.

```{r, echo = TRUE}
# use default parameters
dataset$cal_abund()
# return dataset$taxa_abund
class(dataset$taxa_abund)
```

```{r, echo = TRUE, eval = FALSE}
# show part of the relative abundance at Phylum level
dataset$taxa_abund$Phylum[1:5, 1:5]
```

```{r, echo = FALSE}
pander::pander(dataset$taxa_abund$Phylum[1:5, 1:5])
```


The function save_abund() can be used to save the taxa abundance file to a local place easily.

```{r, echo = TRUE, eval = FALSE}
dataset$save_abund(dirpath = "taxa_abund")
```

Then, let's calculate the alpha diversity.
The result is also stored in the object microtable automatically.
For the definition of each alpha diversity index, please see http://scikit-bio.org/docs/latest/generated/skbio.diversity.alpha.html

```{r, echo = TRUE}
# If you want to add Faith's phylogenetic diversity, use PD = TRUE, this will be a little slow
dataset$cal_alphadiv(PD = FALSE)
# return dataset$alpha_diversity
class(dataset$alpha_diversity)
```

```{r, echo = TRUE, eval = FALSE}
# save dataset$alpha_diversity to a directory
dataset$save_alphadiv(dirpath = "alpha_diversity")
```

Let's go on to beta diversity with function cal_betadiv().
We provide four most frequently used indexes: Bray-curtis, Jaccard, weighted Unifrac and unweighted unifrac.

```{r, echo = FALSE, eval = TRUE, message = FALSE}
invisible(dataset$cal_betadiv(unifrac = FALSE))
```

```{r, echo = TRUE, eval = FALSE}
# If you do not want to calculate unifrac metrics, use unifrac = FALSE
# require GUniFrac package installed
dataset$cal_betadiv(unifrac = TRUE)
# return dataset$beta_diversity
class(dataset$beta_diversity)
# save dataset$beta_diversity to a directory
dataset$save_betadiv(dirpath = "beta_diversity")
```

### Other examples

From v0.7.0, the microtable$new has a new parameter auto_tidy. if auto_tidy = TRUE, the function can automatically use tidy_dataset to make all files uniform.
Then, all other functions in microtable will also do this. But if the user changes the file in microtable object,
the class can not recognize this modification, the user should use tidy_dataset function to manually trim the microtable object.

```{r, echo = TRUE, eval = TRUE}
test <- microtable$new(sample_table = sample_info_16S[1:40, ], otu_table = otu_table_16S, auto_tidy = FALSE)
test
test1 <- microtable$new(sample_table = sample_info_16S[1:40, ], otu_table = otu_table_16S, auto_tidy = TRUE)
test1
test1$sample_table %<>% .[1:10, ]
test1
test1$tidy_dataset()
test1
```

The phylogenetic tree can be read with read.tree function in ape package.

```{r, echo = TRUE, eval = FALSE}
# use the example data rep_phylo.tre in file2meco package https://chiliubio.github.io/microeco_tutorial/file2meco-package.html#qiime
phylo_file_path <- system.file("extdata", "rep_phylo.tre", package="file2meco")
tree <- ape::read.tree(phylo_file_path)
```

### Key points

  + sample_table: rownames of sample_table must be sample names used
  + otu_table: rownames must be feature names; colnames must be sample names
  + microtable class: creating microtable object requires at least one file input (otu_table)
  + tidy_taxonomy(): necessary to make taxonomic table have unified format
  + tidy_dataset(): necessary to clean files in microtable object
  + cal_abund(): powerful and flexible to cope with complex cases in tax_table, see the parameters
  + taxa_abund: taxa_abund is a list stored in microtable object and have several data frame
  + beta_diversity: beta_diversity is also a list stored in microtable object and have several distance matrix


### Other functions

  + add_rownames2taxonomy(): add the rownames of tax_table as the last column of tax_table directly; very useful in some analysis, e.g. OTU/ASV biomarker finding.
  + taxa_sums(): sum the abundance for each taxa; very useful for taxa abundance checking and filtering
  + merge_samples(): merge samples according to a specific group of sample_table to generate a new microtable object
  + merge_taxa(): merge taxa according to a specific taxonomic rank level of tax_table to generate a new microtable object
  + rename_taxa(): rename the OTU/ASV names in all the files of microtable object
  + sample_names(): output sample names in microtable object
  + taxa_names(): output taxa names in microtable object