Core.Rmd

---
title: "Core microbiome"
author: "Leo Lahti, Sudarshan Shetty et al."
bibliography: 
- bibliography.bib
output:
  BiocStyle::html_document:
    number_sections: no
    toc: yes
    toc_depth: 4
    toc_float: true
    self_contained: true
    thumbnails: true
    lightbox: true
    gallery: true
    use_bookdown: false
    highlight: haddock
   
---
<!--
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{microbiome tutorial - core}
  %\usepackage[utf8]{inputenc}
  %\VignetteEncoding{UTF-8}  
-->

See also related functions for the analysis of rare and variable taxa (rare_members; rare_abundance; rare_members; rare_abundance; low_abundance).  


```{r setup, message = FALSE, warning = FALSE, results = 'hide'}
library("devtools")
#install_github("microbiome/microbiome")
```


# HITChip Data

Load example data:


```{r core-prevalence, warning=FALSE, message=FALSE}
# Load data
library(microbiome)
data(peerj32)

# Rename the data
pseq <- peerj32$phyloseq

# Calculate compositional version of the data
# (relative abundances)
pseq.rel <- microbiome::transform(pseq, "compositional")
```


## Prevalence of taxonomic groups

Relative population frequencies; at 1% compositional abundance threshold:

```{r core-prevalence2}
head(prevalence(pseq.rel, detection = 1/100, sort = TRUE))
```


Absolute population frequencies (sample count):

```{r core-prevalence2b}
head(prevalence(pseq.rel, detection = 1/100, sort = TRUE, count = TRUE))
```


## Core microbiota analysis

If you only need the names of the core taxa, do as follows. This returns the taxa that exceed the given prevalence and detection thresholds. 

```{r core-members, message=FALSE, warning=FALSE, eval = FALSE}
core.taxa.standard <- core_members(pseq.rel, detection = 0, prevalence = 50/100)
```


A full phyloseq object of the core microbiota is obtained as follows:

```{r core-data, message=FALSE, warning=FALSE}
pseq.core <- core(pseq.rel, detection = 0, prevalence = .5)
```

We can also collapse the rare taxa into an "Other" category

```{r core_aggregate_rare, message=FALSE, warning=FALSE}
pseq.core2 <- aggregate_rare(pseq.rel, "Genus", detection = 0, prevalence = .5)
```

Retrieving the core taxa names from the phyloseq object:

```{r core-taxa, message=FALSE, warning=FALSE}
core.taxa <- taxa(pseq.core)
```


## Core abundance and diversity

Total core abundance in each sample (sum of abundances of the core members):

```{r core-ab, message=FALSE, warning=FALSE}
core.abundance <- sample_sums(core(pseq.rel, detection = .01, prevalence = .95))
```


## Core visualization

## Core line plots

Determine core microbiota across various abundance/prevalence
thresholds with the blanket analysis [(Salonen et al. CMI, 2012)](http://onlinelibrary.wiley.com/doi/10.1111/j.1469-0691.2012.03855.x/abstract) based on various signal and prevalences.

```{r core2, fig.width=9, fig.heigth=6, out.width="400px", warning=FALSE}
# With compositional (relative) abundances
det <- c(0, 0.1, 0.5, 2, 5, 20)/100
prevalences <- seq(.05, 1, .05)
 #ggplot(d) + geom_point(aes(x, y)) + scale_x_continuous(trans="log10", limits=c(NA,1))


plot_core(pseq.rel, 
          prevalences = prevalences, 
          detections = det, 
          plot.type = "lineplot") + 
  xlab("Relative Abundance (%)")
```


## Core heatmaps

This visualization method has been used for instance in [Intestinal microbiome landscaping: Insight in community assemblage and implications for microbial modulation strategies](https://academic.oup.com/femsre/article/doi/10.1093/femsre/fuw045/2979411/Intestinal-microbiome-landscaping-insight-in#58802539). Shetty et al. _FEMS Microbiology Reviews_ fuw045, 2017.

Note that you can order the taxa on the heatmap with the taxa.order argument.

```{r core-example3a, fig.width=6, fig.heigth=8, out.width="400px", warning=FALSE}

# Core with compositionals:
library(RColorBrewer)
library(reshape)

prevalences <- seq(.05, 1, .05)

detections <- round(10^seq(log10(0.01), log10(.2), length = 9), 3)

# Also define gray color palette
gray <- gray(seq(0,1,length=5))

#Added pseq.rel, I thin... must be checked if it was in the the rednred version,; where it is initialized
#pseq.rel<- microbiome::transform(pseq, 'compositional')
#min-prevalence gets the 100th highest prevalence
p <- plot_core(pseq.rel,
               plot.type = "heatmap", 
               colours = gray,
               prevalences = prevalences, 
               detections = detections, 
               min.prevalence = prevalence(pseq.rel, sort = TRUE)[100]) +
  labs(x = "Detection Threshold\n(Relative Abundance (%))") +
    
  #Adjusts axis text size and legend bar height
  theme(axis.text.y= element_text(size=8, face="italic"),
        axis.text.x.bottom=element_text(size=8),
        axis.title = element_text(size=10),
        legend.text = element_text(size=8),
        legend.title = element_text(size=10))

print(p)
```


```{r core-example3b, fig.width=9, fig.heigth=6, out.width="400px",warning=FALSE}

# Core with absolute counts and horizontal view:
# and minimum population prevalence (given as percentage)
detections <- seq(from = 50, to = round(max(abundances(pseq))/10, -1), by = 100)

p <- plot_core(pseq, plot.type = "heatmap",
               prevalences = prevalences,
               detections = detections,
               colours = rev(brewer.pal(5, "Spectral")),
               min.prevalence = .2, horizontal = TRUE) +
  theme(axis.text.x= element_text(size=8, face="italic", hjust=1),
        axis.text.y= element_text(size=8),
        axis.title = element_text(size=10),
        legend.text = element_text(size=8),
        legend.title = element_text(size=10))

print(p)
```

# Core Microbiota using Amplicon data

## Make phyloseq object

This tutorial is useful for analysis of output files from [(Mothur)](https://www.mothur.org/), [(QIIME or QIIME2)](https://qiime2.org/) or any tool that gives a biom file as output. There is also a simple way to read comma seperated (*.csv) files.  

Simple comma seperated files:  

```{r, read-simple-csv-otu-tables, warning=FALSE, message=FALSE, eval=FALSE}
library(microbiome)


otu.file <-
    system.file("extdata/qiita1629_otu_table.csv",
        package='microbiome')

tax.file <- system.file("extdata/qiita1629_taxonomy_table.csv",
        package='microbiome')

meta.file <- system.file("extdata/qiita1629_mapping_subset.csv",
        package='microbiome')

pseq.csv <- read_phyloseq(
          otu.file=otu.file, 
          taxonomy.file=tax.file, 
          metadata.file=meta.file, type = "simple")
```

Biom file:  

```{r, read-otu-biom, eval=FALSE}

# Read the biom file
biom.file <- 
  system.file("extdata/qiita1629.biom", 
              package = "microbiome")

# Read the mapping/metadata file
 meta.file <- 
  system.file("extdata/qiita1629_mapping.csv", 
              package = "microbiome")
# Make phyloseq object
pseq.biom <- read_phyloseq(otu.file = biom.file, 
                         metadata.file = meta.file, 
                         taxonomy.file = NULL, type = "biom")
```


Mothur shared OTUs and Consensus Taxonomy:  

```{r, read-otu-mothur, eval=FALSE}
otu.file <- system.file(
 "extdata/Baxter_FITs_Microbiome_2016_fit.final.tx.1.subsample.shared",
    package='microbiome')

tax.file <- system.file(
 "extdata/Baxter_FITs_Microbiome_2016_fit.final.tx.1.cons.taxonomy",
    package='microbiome')

meta.file <- system.file(
 "extdata/Baxter_FITs_Microbiome_2016_mapping.csv",
    package='microbiome')
 
pseq.mothur <- read_phyloseq(otu.file=otu.file,
        taxonomy.file =tax.file,
        metadata.file=meta.file, type = "mothur")
print(pseq.mothur)
```

Now, we proceed to core microbiota analysis.

## Core microbiota analysis  

Here the data from [Caporaso, J. Gregory, et al. "Moving pictures of the human microbiome." Genome biology 12.5 (2011): R50.](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2011-12-5-r50?report=reader) will be used which is stored as example in [jeevanuDB](https://github.com/microsud/jeevanuDB) 

```{r, core-microbiota-amplicon-data, eval=TRUE}
# install
# install.packages("devtools")
# devtools::install_github("microsud/jeevanuDB")

# check the data 
library(jeevanuDB)
ps <- moving_pictures
table(meta(ps)$sample_type, meta(ps)$host_subject_id)
# Filter the data to include only gut samples from M3 subject
ps.m3 <- subset_samples(ps, sample_type == "stool" & host_subject_id == "M3") 
print(ps.m3)
# keep only taxa with positive sums
ps.m3 <- prune_taxa(taxa_sums(ps.m3) > 0, ps.m3)
print(ps.m3)

# Calculate compositional version of the data
# (relative abundances)
ps.m3.rel <- microbiome::transform(ps.m3, "compositional")
```

Output of deblur/dada2 will most likely have seqs as rownames instead of OTU ids or taxa names
```{r core-tax-names}

taxa_names(ps.m3.rel)[1:2]

```

We can change it to ASVIDs 

```{r dna-seq-extraction, message=FALSE, eval=TRUE}

ps.m3.rel <- microbiome::add_refseq(ps.m3.rel)
# Check if ref_seq slot is added to phyloseq object
print(ps.m3.rel)
# now check taxa names are ASVids
taxa_names(ps.m3.rel)[1:3]

```

### Core microbiota analysis

If you only need the names of the core taxa, do as follows. This returns the taxa that exceed the given prevalence and detection thresholds. 


```{r core-members2, message=FALSE, warning=FALSE, eval = TRUE}
core.taxa.standard <- core_members(ps.m3.rel, detection = 0.0001, prevalence = 50/100)

core.taxa.standard
```

We notice that ASV ids by themselves are not informative in this case. In this phyloseq object, the unclassified taxonomic values have a pattern like `k__``to represent kingdom level and so on. We need to change these to NAs.
This can be variable depending on your data.   
```{r}

# first combine genus and species names. 
tax_table(ps.m3.rel)[tax_table(ps.m3.rel) == "k__"] <- NA
tax_table(ps.m3.rel)[tax_table(ps.m3.rel) == "p__"] <- NA
tax_table(ps.m3.rel)[tax_table(ps.m3.rel) == "c__"] <- NA
tax_table(ps.m3.rel)[tax_table(ps.m3.rel) == "o__"] <- NA
tax_table(ps.m3.rel)[tax_table(ps.m3.rel) == "f__"] <- NA
tax_table(ps.m3.rel)[tax_table(ps.m3.rel) == "g__"] <- NA
tax_table(ps.m3.rel)[tax_table(ps.m3.rel) == "s__"] <- NA

tax_table(ps.m3.rel)[, colnames(tax_table(ps.m3.rel))] <- gsub(tax_table(ps.m3.rel)[, colnames(tax_table(ps.m3.rel))],  pattern = "[a-z]__", replacement = "")

# Use the microbiome function add_besthit to get taxonomic identities of ASVs.
ps.m3.rel.f <- microbiome::add_besthit(ps.m3.rel)

# Check 
taxa_names(ps.m3.rel.f)[1:10]

```

Now we add the best taxonomic classification available.  
```{r}

core.taxa.standard <- core_members(ps.m3.rel.f, detection = 0.0001, prevalence = 50/100)

core.taxa.standard
```


A full phyloseq object of the core microbiota is obtained as follows:

```{r core-data2, message=FALSE, warning=FALSE, eval=TRUE}
pseq.core <- core(ps.m3.rel.f, detection = 0.0001, prevalence = .5)
```


Retrieving the associated taxonomy from the phyloseq object:

```{r core-taxa2, message=FALSE, warning=FALSE, eval=TRUE}
core.taxa <- taxa(pseq.core)
class(core.taxa)
# get the taxonomy data
tax.mat <- tax_table(pseq.core)
tax.df <- as.data.frame(tax.mat)

# add the OTus to last column
tax.df$OTU <- rownames(tax.df)

# select taxonomy of only 
# those OTUs that are core memebers based on the thresholds that were used.
core.taxa.class <- dplyr::filter(tax.df, rownames(tax.df) %in% core.taxa)
knitr::kable(head(core.taxa.class))
```


## Core visualization

### Core line plots

Determine core microbiota across various abundance/prevalence
thresholds with the blanket analysis [(Salonen et al. CMI, 2012)](http://onlinelibrary.wiley.com/doi/10.1111/j.1469-0691.2012.03855.x/abstract) based on various signal and prevalences.

```{r core2b, warning=FALSE, eval=TRUE}
# With compositional (relative) abundances
det <- c(0, 0.1, 0.5, 2, 5, 20)/100
prevalences <- seq(.05, 1, .05)

plot_core(ps.m3.rel.f, prevalences = prevalences, 
          detections = det, plot.type = "lineplot") + 
  xlab("Relative Abundance (%)") + 
  theme_bw()

```

### Core heatmaps

This visualization method has been used for instance in [Intestinal microbiome landscaping: Insight in community assemblage and implications for microbial modulation strategies](https://academic.oup.com/femsre/article/doi/10.1093/femsre/fuw045/2979411/Intestinal-microbiome-landscaping-insight-in#58802539). Shetty et al. _FEMS Microbiology Reviews_ fuw045, 2017.

Note that you can order the taxa on the heatmap with the order.taxa argument.

```{r core-example3, warning=FALSE, eval=TRUE}

# Core with compositionals:
prevalences <- seq(.05, 1, .05)
detections <- round(10^seq(log10(1e-2), log10(.2), length = 10), 3)

#Deletes "ASV" from taxa_names, e.g. ASV1 --> 1
#taxa_names(ps.m3.rel) = taxa_names(ps.m3.rel) %>% str_replace("ASV", "")
# Also define gray color palette
gray <- gray(seq(0,1,length=5))

p1 <- plot_core(ps.m3.rel.f,
  plot.type = "heatmap",
  colours = gray,
  prevalences = prevalences,
  detections = detections, min.prevalence = .5) +
  xlab("Detection Threshold (Relative Abundance (%))")

p1 <- p1 + theme_bw() + ylab("ASVs")
p1
```

Using viridis color palette  
```{r core-example3_plot, warning=FALSE, eval=TRUE, fig.width=8}

library(viridis)
print(p1 + scale_fill_viridis())

```

## Genus level 

```{r}
ps.m3.rel.gen <- aggregate_taxa(ps.m3.rel, "Genus")

# Check if any taxa with no genus classification. aggregate_taxa will merge all unclassified to Unknown
any(taxa_names(ps.m3.rel.gen) == "Unknown")
# Remove Unknown
ps.m3.rel.gen <- subset_taxa(ps.m3.rel.gen, Genus!="Unknown")
```

```{r fig.width=10, warning=FALSE, eval=TRUE}
library(RColorBrewer)
prevalences <- seq(.05, 1, .05)
detections <- round(10^seq(log10(1e-5), log10(.2), length = 10), 3)

p1 <- plot_core(ps.m3.rel.gen, 
                plot.type = "heatmap", 
                colours = rev(brewer.pal(5, "RdBu")),
                prevalences = prevalences, 
                detections = detections, min.prevalence = .5) +
    xlab("Detection Threshold (Relative Abundance (%))")
p1 <- p1 + theme_bw() + ylab("ASVs")
p1

```

Some taxa name are long. Shorten them as follows and plot.
```{r fig.width=8, warning=FALSE, eval=TRUE}
taxa_names(ps.m3.rel.gen) <- gsub("Bacteria_Firmicutes_Clostridia_Clostridiales_",
                                  "", taxa_names(ps.m3.rel.gen))
p1 <- plot_core(ps.m3.rel.gen, 
                plot.type = "heatmap", 
                colours = rev(brewer.pal(5, "RdBu")),
                prevalences = prevalences, 
                detections = detections, min.prevalence = .5) +
    xlab("Detection Threshold (Relative Abundance (%))") + 
  theme_bw() +
  theme(axis.text.x = element_text(angle=90),
        axis.text.y = element_text(face = "italic"))

p1
```