CompositionAmplicondata2.Rmd

---
title: "Microbiome composition for Amplicon data"
author: "Leo Lahti, Sudarshan Shetty et al. `r Sys.Date()`"
bibliography: 
- bibliography.bib
output:
  BiocStyle::html_document:
    number_sections: no
    toc: yes
    toc_depth: 4
    toc_float: true
    self_contained: true
    thumbnails: true
    lightbox: true
    gallery: true
    use_bookdown: false
    highlight: haddock    
---
  <!--
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{microbiome tutorial - composition}
  %\usepackage[utf8]{inputenc}
  %\VignetteEncoding{UTF-8}  
-->
  
  
Also see [phyloseq barplot examples](http://joey711.github.io/phyloseq/plot_bar-examples.html).
  
Check the [core microbiome page](http://microbiome.github.io/microbiome/CoremicrobiotaAmplicon.html) which shows how to read the your files into R and make a phyloseq object.  

Here, an example data is used which is not available in the package currently due to size limitations. We will directly strat from a ready phyloseq object. Please test your own data and give your feedback on the [issues page](https://github.com/microbiome/microbiome/issues). 

  
```{r composition-amplicon-example1, warning=FALSE, message=FALSE, eval=FALSE}
# Example data
library(microbiome)
# Try another theme
# from https://github.com/hrbrmstr/hrbrthemes
# you can install these if you don't have it already.
# devtools::install_github("hrbrmstr/hrbrthemes")
library(hrbrthemes)
library(gcookbook)
library(tidyverse)
library(dplyr)

data("DynamicsIBD") #This data is currently unavialable due to size limitations
ps1 <- DynamicsIBD
colnames(tax_table(ps1))
```

As you can see the taxonomic classification is just lablled as "Rank1" ... "Rank7". We need to change this to proper designation and also do some formatting of the data. This can be a useful example for understanding simple file processing in R.  

```{r, composition-amplicon-example1a, eval=FALSE}
# First change the column names of the taxonomy table in phyloseq to following:

colnames(tax_table(ps1)) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species" )

tax_table(ps1)[tax_table(ps1)[,"Kingdom"]== "NA", "Kingdom" ] <- "Unidentified_Kingdom"

tax_table(ps1)[tax_table(ps1)[,"Phylum"]== "p__", "Phylum" ] <- "p__Unidentified_Phylum"

# make a dataframe for taxonomy information.

taxic <- as.data.frame(ps1@tax_table) 
otu.df <- abundances(ps1)

# make a dataframe for OTU information.
otu.df <- as.data.frame(otu.df)

# check the rows and columns
# head(otu.df) 

# Add the OTU ids from OTU table into the taxa table at the end.
taxic$OTU <- row.names.data.frame(otu.df) 

# You can see that we now have extra taxonomy levels.
colnames(taxic)

# convert it into a matrix.
taxmat <- as.matrix(taxic)

# convert into phyloseq compaitble file.
new.tax <- tax_table(taxmat)  

# incroporate into phyloseq Object
tax_table(ps1) <- new.tax 

```

# Composition barplots

The compositon plots can be shown either as barplots or heatmaps. Both examples are show below.  

## Plot relative abundance  

Now we can improve the plot further.  
Let's try at Family level. 

```{r composition-amplicon-example3, fig.width=12, fig.height=5, out.width="400px", fig.show="hold", warning=FALSE, message=FALSE, eval=FALSE}
library(phyloseq)
# for example purpose we will remove samples with less than 10000
ps2 = prune_samples(sample_sums(ps1)>=2000, ps1)

# To speed up the example we will use only those OTUs that are detected 100 times and present in 50% of the samples.

pseq2 <- microbiome::core(ps2, detection = 100, prevalence = .5)

# Improve the plotting
tax_table(pseq2)[tax_table(pseq2)[,"Family"]== "f__", "Family" ] <- "f__Unclassified Family"

# We will also remove the "f__" patterns for cleaner labels
tax_table(pseq2)[,colnames(tax_table(pseq2))] <- gsub(tax_table(pseq2)[,colnames(tax_table(pseq2))],pattern="[a-z]__",replacement="")

# merge at family level.

pseq.fam <- aggregate_taxa(pseq2, "Family")

p.fam <- plot_composition(pseq.fam, sample.sort = NULL, otu.sort = NULL,
  x.label = "ibd_subtype", plot.type = "barplot", verbose = FALSE)

print(p.fam)

```

## Plot relative abundance

```{r composition-amplicon-example4, fig.width=12, fig.height=5, out.width="400px", fig.show="hold", warning=FALSE, message=FALSE, eval=FALSE}

pseq.famrel <- transform(pseq.fam, "compositional")

p.famrel <- plot_composition(pseq.famrel, sample.sort = NULL, otu.sort = NULL,
  x.label = "ibd_subtype", plot.type = "barplot", verbose = FALSE)

print(p.famrel)

# further improvements can be done as follows  

p.famrel <- plot_composition(pseq.famrel, 
                             sample.sort = NULL, 
                             otu.sort = NULL, 
                             x.label = "ibd_subtype", 
                             plot.type = "barplot", 
                             verbose = FALSE) + 
  guides(fill = guide_legend(ncol = 1)) + 
  scale_y_percent() + 
  labs(x = "Samples", 
       y = "Relative abundance (%)",
       title = "Relative abundance data", 
       subtitle = "Subtitle",
       caption = "Caption text.") + 
     theme_ipsum(grid="Y")

print(p.famrel)
```

Average by group

```{r composition-amplicon-example5, fig.width=12, fig.height=5, out.width="400px", fig.show="hold", warning=FALSE, message=FALSE, eval=FALSE}
# Averaged by group
p <- plot_composition(pseq.famrel,
                      average_by = "ibd_subtype") + 
  guides(fill = guide_legend(ncol = 1)) + 
  scale_y_percent() + 
  labs(x = "Samples", 
       y = "Relative abundance (%)",
       title = "Relative abundance data", 
       subtitle = "Subtitle",
       caption = "Caption text.") + 
     theme_ipsum(grid="Y")

print(p)

```

# Heatmap composition  

```{r composition-amplicon-example6, fig.width=12, fig.height=5, out.width="400px", fig.show="hold", warning=FALSE, message=FALSE, eval=FALSE}

pseq.famlog <- transform(pseq.fam, "log10")

p.famrel.heatmap <- plot_composition(pseq.famlog, 
                             sample.sort = NULL, 
                             otu.sort = NULL, 
                             x.label = "ibd_subtype", 
                             plot.type = "heatmap", 
                             verbose = FALSE)

print(p.famrel.heatmap)
```


# Plot taxa prevalence

We use the Dynamics IBD data set from [Halfvarson J., et al. Nature
Microbiology, 2017](http://www.nature.com/articles/nmicrobiol20174) as
downloaded from [Qiita ID
1629](https://qiita.ucsd.edu/study/description/1629). This function
allows you to have an overview of OTU prevalences alongwith their
taxonomic affiliations. This will aid in checking if you filter OTUs
based on prevalence, then what taxonomic affliations will be lost.

```{r plot_prev1, fig.height=6, fig.width=8, eval=FALSE}

# We will use the ps1 object we created previously.
print(ps1)
# Use sample and taxa subset to speed up example
p0 <- subset_samples(ps1, sex == "male" & timepoint == 1)

# For the available taxonomic levels
plot_taxa_prevalence(p0, "Phylum")
```