diff --git a/02-intro2R.Rmd b/02-intro2R.Rmd index 24a015d..b623287 100644 --- a/02-intro2R.Rmd +++ b/02-intro2R.Rmd @@ -104,8 +104,8 @@ Visualization is an important part of all data analysis techniques including com - Visualization of quantitative assays for given locus in the genome ## Getting started with R -Download and install R (http://cran.r-project.org/) and RStudio (http://www.rstudio.com/) if you do not have them already. Rstudio is optional but it is a great tool if you are just starting to learn R. -You will need specific data sets to run the code snippets in this book; we have explained how to install and use the data in the [Data for the book] section in the [Preface]. If you have not used Rstudio before, we recommend running it and familiarizing yourself with it first. To put it simply, this interface combines multiple features you will need while analyzing data. You can see your code, how it is executed, the plots you make, and your data all in one interface. +Download and install R (http://cran.r-project.org/) and RStudio (http://www.rstudio.com/) if you do not have them already. RStudio is optional but it is a great tool if you are just starting to learn R. +You will need specific data sets to run the code snippets in this book; we have explained how to install and use the data in the [Data for the book] section in the [Preface]. If you have not used RStudio before, we recommend running it and familiarizing yourself with it first. To put it simply, this interface combines multiple features you will need while analyzing data. You can see your code, how it is executed, the plots you make, and your data all in one interface. ### Installing packages @@ -116,12 +116,12 @@ You can install CRAN packages using `install.packages()` (# is the comment chara # install package named "randomForests" from CRAN install.packages("randomForests") ``` -You can install bioconductor packages with a specific installer script. +You can install Bioconductor packages with a specific installer script. ```{r installpack2,eval=FALSE} # get the installer package if you don't have install.packages("BiocManager") -# install bioconductor package "rtracklayer" +# install Bioconductor package "rtracklayer" BiocManager::install("rtracklayer") ``` You can install packages from GitHub using the `install_github()` function from `devtools` package. @@ -146,7 +146,7 @@ You can also update CRAN and Bioconductor packages. # updating CRAN packages update.packages() -# updating bioconductor packages +# updating Bioconductor packages if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install() @@ -229,8 +229,8 @@ A matrix refers to a numeric array of rows and columns. You can think of it as a x<-c(1,2,3,4) y<-c(4,5,6,7) m1<-cbind(x,y);m1 -t(m1) # transpose of m1 -dim(m1) # 2 by 5 matrix +t(m1) # transposed of m1 +dim(m1) # 4 by 2 matrix ``` You can also directly list the elements and specify the matrix: ```{r matrix2} @@ -379,7 +379,7 @@ We can modify all the plots by providing certain arguments to the plotting funct hist(x,main="Hello histogram!!!",col="red") ``` -Next, we will make a scatter plot. Scatter plots are one the most common plots you will encounter in data analysis. We will sample another set of 50 values and plot those against the ones we sampled earlier. The scatter plot shows values of two variables for a set of data points. It is useful to visualize relationships between two variables. It is frequently used in connection with correlation and linear regression. There are other variants of scatter plots which show density of the points with different colors. We will show examples of those scatter plots in later chapters. The scatter plot from our sampling experiment is shown in Figure \@ref(fig:makeScatter). Notice that, in addition to `main` argument we used `xlab` and `ylab` arguments to give labels to the plot. You can customize the plots even more than this. See `?plot` and `?par` for more arguments that can help you customize the plots. +Next, we will make a scatter plot. Scatter plots are one of the most common plots you will encounter in data analysis. We will sample another set of 50 values and plot those against the ones we sampled earlier. The scatter plot shows values of two variables for a set of data points. It is useful to visualize relationships between two variables. It is frequently used in connection with correlation and linear regression. There are other variants of scatter plots which show density of the points with different colors. We will show examples of those scatter plots in later chapters. The scatter plot from our sampling experiment is shown in Figure \@ref(fig:makeScatter). Notice that, in addition to `main` argument we used `xlab` and `ylab` arguments to give labels to the plot. You can customize the plots even more than this. See `?plot` and `?par` for more arguments that can help you customize the plots. ```{r makeScatter,out.width='50%',fig.width=5, fig.cap="Scatter plot example."} # randomly sample 50 points from normal distribution @@ -595,7 +595,7 @@ The task above is a bit pointless. Normally in a loop, you would want to do some the data frame (by subtracting the end coordinate from the start coordinate).\index{R Programming Language!loops} -**Note:**If you are going to run a loop that has a lot of repetitions, it is smart to try the loop with few repetitions first and check the results. This will help you make sure the code in the loop works before executing it thousands of times. +**Note:** If you are going to run a loop that has a lot of repetitions, it is smart to try the loop with few repetitions first and check the results. This will help you make sure the code in the loop works before executing it thousands of times. ```{r forloop2} # this is where we will keep the lenghts @@ -968,7 +968,7 @@ hist(x1,main="title") 18. Do the same as above but this time with `par(mfrow=c(1,2))`. [Difficulty: **Beginner/Intermediate**] -19. Save your plot using the "Export" button in Rstudio. [Difficulty: **Beginner**] +19. Save your plot using the "Export" button in RStudio. [Difficulty: **Beginner**] 20. You can make a scatter plot showing the density of points rather than points themselves. If you use points it looks like this: diff --git a/03-StatsForGenomics.Rmd b/03-StatsForGenomics.Rmd index a7bfad2..7c86f8b 100644 --- a/03-StatsForGenomics.Rmd +++ b/03-StatsForGenomics.Rmd @@ -320,7 +320,7 @@ original sample. Then, we calculate the parameter of interest, in this case the repeat this process a large number of times, such as 1000. At this point, we would have a distribution of re-sampled means. We can then calculate the 2.5th and 97.5th percentiles and these will be our so-called 95% confidence interval. This procedure, resampling with replacement to -estimate the precision of population parameter estimates, is known as the __bootstrap resampling__ or __bootstraping__.\index{bootstrap resampling} +estimate the precision of population parameter estimates, is known as the __bootstrap resampling__ or __bootstrapping__.\index{bootstrap resampling} Let's see how we can do this in practice. We simulate a sample coming from a normal distribution (but we pretend we don't know the @@ -761,20 +761,20 @@ dx=rowMeans(data[,group1])-rowMeans(data[,group2]) require(matrixStats) -# get the esimate of pooled variance +# get the estimate of pooled variance stderr = sqrt( (rowVars(data[,group1])*(n1-1) + rowVars(data[,group2])*(n2-1)) / (n1+n2-2) * ( 1/n1 + 1/n2 )) # do the shrinking towards median mod.stderr = (stderr + median(stderr)) / 2 # moderation in variation -# esimate t statistic with moderated variance +# estimate t statistic with moderated variance t.mod <- dx / mod.stderr # calculate P-value of rejecting null p.mod = 2*pt( -abs(t.mod), n1+n2-2 ) -# esimate t statistic without moderated variance +# estimate t statistic without moderated variance t = dx / stderr # calculate P-value of rejecting null @@ -955,7 +955,7 @@ $$ \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \\ -\epsilon_0 +\epsilon_4 \end{array}\right] $$ @@ -1550,7 +1550,7 @@ the issue. ##### Correlation of explanatory variables If the explanatory variables are correlated that could lead to something -known as multicolinearity. When this happens SE estimates of the coefficients will be too large. This is usually observed in time-course +known as multicollinearity. When this happens SE estimates of the coefficients will be too large. This is usually observed in time-course data. ##### Correlation of error terms diff --git a/04-unsupervisedLearning.Rmd b/04-unsupervisedLearning.Rmd index 0976b7e..55420fb 100644 --- a/04-unsupervisedLearning.Rmd +++ b/04-unsupervisedLearning.Rmd @@ -75,7 +75,7 @@ scale(df) ``` -### Hiearchical clustering +### Hierarchical clustering This is one of the most ubiquitous clustering algorithms. Using this algorithm you can see the relationship of individual data points and relationships of clusters. This is achieved by successively joining small clusters to each other based on the inter-cluster distance. Eventually, you get a tree structure or a dendrogram that shows the relationship between the individual data points and clusters. The height of the dendrogram is the distance between clusters. Here we can show how to use this on our toy data set from four patients. The base function in R to do hierarchical clustering in `hclust()`. Below, we apply that function on Euclidean distances between patients. The resulting clustering tree or dendrogram is shown in Figure \@ref(fig:expPlot).\index{clustering!hierarchical clustering} ```{r toyClust,fig.cap="Dendrogram of distance matrix",out.width='50%'} d=dist(df) @@ -175,7 +175,7 @@ type2kmclu = data.frame( table(type2kmclu) ``` -We cannot visualize the clustering from partitioning methods with a tree like we did for hierarchical clustering. Even if we can get the distances between patients the algorithm does not return the distances between clusters out of the box. However, if we had a way to visualize the distances between patients in 2 dimensions we could see the how patients and clusters relate to each other. It turns out that there is a way to compress between patient distances to a 2-dimensional plot. There are many ways to do this, and we introduce these dimension-reduction methods including the one we will use later in this chapter. For now, we are going to use a method called "multi-dimensional scaling" and plot the patients in a 2D plot color coded by their cluster assignments shown in Figure \@ref(fig:kmeansmds). We will explain this method in more detail in the [Multi-dimensional scaling] section below. +We cannot visualize the clustering from partitioning methods with a tree like we did for hierarchical clustering. Even if we can get the distances between patients the algorithm does not return the distances between clusters out of the box. However, if we had a way to visualize the distances between patients in 2 dimensions we could see how patients and clusters relate to each other. It turns out that there is a way to compress between patient distances to a 2-dimensional plot. There are many ways to do this, and we introduce these dimension-reduction methods including the one we will use later in this chapter. For now, we are going to use a method called "multi-dimensional scaling" and plot the patients in a 2D plot color coded by their cluster assignments shown in Figure \@ref(fig:kmeansmds). We will explain this method in more detail in the [Multi-dimensional scaling] section below. ```{r, kmeansmds,out.width='50%',fig.cap="K-means cluster memberships are shown in a multi-dimensional scaling plot"} # Calculate distances @@ -197,7 +197,7 @@ legend("bottomright", The plot we obtained shows the separation between clusters. However, it does not do a great job showing the separation between clusters 3 and 4, which represent CML and "no leukemia" patients. We might need another dimension to properly visualize that separation. In addition, those two clusters were closely related in the hierarchical clustering as well. ### How to choose "k", the number of clusters -Up to this point, we have avoided the question of selecting optimal number clusters. How do we know where to cut our dendrogram or which k to choose ? +Up to this point, we have avoided the question of selecting optimal number of clusters. How do we know where to cut our dendrogram or which k to choose ? First of all, this is a difficult question. Usually, clusters have different granularity. Some clusters are tight and compact and some are wide, and both these types of clusters can be in the same data set. When visualized, some large clusters may look like they may have sub-clusters. So should we consider the large cluster as one cluster or should we consider the sub-clusters as individual clusters? There are some metrics to help but there is no definite answer. We will show a couple of them below. #### Silhouette @@ -534,7 +534,7 @@ As you might have noticed, we set again a random seed with the `set.seed()` func __Want to know more ?__ -- How perplexity affects t-sne, interactive examples: https://distill.pub/2016/misread-tsne/ +- How perplexity affects t-SNE, interactive examples: https://distill.pub/2016/misread-tsne/ - More on perplexity: https://blog.paperspace.com/dimension-reduction-with-t-sne/ - Intro to t-SNE: https://www.oreilly.com/learning/an-illustrated-introduction-to-the-t-sne-algorithm diff --git a/05-supervisedLearning.Rmd b/05-supervisedLearning.Rmd index 34eaeed..9b4b2cd 100644 --- a/05-supervisedLearning.Rmd +++ b/05-supervisedLearning.Rmd @@ -56,7 +56,7 @@ There are many methods to use for supervised learning problems. However, there a Many of these steps are identical for different supervised learning methods. Therefore, we will use the [`caret`](http://topepo.github.io/caret/index.html) package to\index{R Packages!\texttt{caret}} perform these steps, which streamlines the steps and provides a similar interface for different supervised learning methods. There are other similar packages, such as [`mlr`](https://mlr.mlr-org.com/), \index{R Packages!\texttt{mlr}}that can provide similar functionality. For now, we will focus on classification models, which is a subset of supervised learning models. In these types of models, we try to predict a categorical response variable, such as if a patient has the disease or not, or what type of disease the patient has based on predictor variables. ## Use case: Disease subtype from genomics data -We will start our illustration of machine learning using a real dataset from tumor biopsies. We will use the gene expression data of glioblastoma tumor samples from The Cancer Genome Atlas\index{The Cancer Genome Atlas (TCGA)} project. We will try to predict the subtype of this disease using molecular markers. \index{CpG island}This subtype is characterized by large-scale epigenetic alterations called the "CpG island methylator phenotype" or "CIMP" [@pmid20399149]; half of the patients in our data set have this subtype and the rest do not, and we will try to predict which ones have the CIMP subtype. There two data objects we need for this exercise, one for gene expression values per tumor sample and the other one is subtype annotation per patient. In the expression data set, every row is a patient and every column is a gene expression value\index{gene expression}. There are 184 tumor samples. This data set might be a bit small for real-world applications, however it is very relevant for the genomics focus of this book and the small datasets take less time to train, which is useful for reproducibility purposes. We will read these data sets from the **compGenomRData** package now with the `readRDS()` function. +We will start our illustration of machine learning using a real dataset from tumor biopsies. We will use the gene expression data of glioblastoma tumor samples from The Cancer Genome Atlas\index{The Cancer Genome Atlas (TCGA)} project. We will try to predict the subtype of this disease using molecular markers. \index{CpG island}This subtype is characterized by large-scale epigenetic alterations called the "CpG island methylator phenotype" or "CIMP" [@pmid20399149]; half of the patients in our data set have this subtype and the rest do not, and we will try to predict which ones have the CIMP subtype. There are two data objects we need for this exercise, one for gene expression values per tumor sample and the other one is subtype annotation per patient. In the expression data set, every row is a patient and every column is a gene expression value\index{gene expression}. There are 184 tumor samples. This data set might be a bit small for real-world applications, however it is very relevant for the genomics focus of this book and the small datasets take less time to train, which is useful for reproducibility purposes. We will read these data sets from the **compGenomRData** package now with the `readRDS()` function. ```{r,readMLdata} # get file paths fileLGGexp=system.file("extdata", @@ -114,7 +114,7 @@ tgexp <- t(gexp) We can filter predictor variables which have low variation. They are not likely to have any predictive importance since there is not much variation and they will just slow our algorithms. The more variables, the slower the algorithms will be generally. The `caret::preProcess()` function can help filter the predictor variables with near zero variance. ```{r nzv,eval=FALSE} library(caret) -# remove near zero variation for the columns at least +# remove near zero variation for the columns where at least # 85% of the values are the same # this function creates the filter but doesn't apply it yet nzv=preProcess(tgexp,method="nzv",uniqueCut = 15) @@ -142,7 +142,7 @@ tgexp=predict(processCenter,tgexp) We will next filter the predictor variables that are highly correlated. You may choose not to do this as some methods can handle correlation between predictor variables.\index{collinearity} However, the fewer predictor variables we have, the faster the model fitting can be done. ```{r filterCorr} -# create a filter for removing higly correlated variables +# create a filter for removing highly correlated variables # if two variables are highly correlated only one of them # is removed corrFilt=preProcess(tgexp, method = "corr",cutoff = 0.9) @@ -192,7 +192,7 @@ tgexp=merge(patient,tgexp,by="row.names") rownames(tgexp)=tgexp[,1] tgexp=tgexp[,-1] ``` -Now that the response variable or the class label is merged with our dataset, we can split it into test and training sets with the `caret::createPartition()` function. +Now that the response variable or the class label is merged with our dataset, we can split it into test and training sets with the `caret::createDataPartition()` function. ```{r datapart} set.seed(3031) # set the random number seed for reproducibility @@ -200,7 +200,7 @@ set.seed(3031) # set the random number seed for reproducibility # get indices for 70% of the data set intrain <- createDataPartition(y = tgexp[,1], p= 0.7)[[1]] -# seperate test and training sets +# separate test and training sets training <- tgexp[intrain,] testing <- tgexp[-intrain,] @@ -294,7 +294,7 @@ Test set accuracy is not as good as the training accuracy, which is usually the ### Receiver Operating Characteristic (ROC) curves -One important and popular metric when evaluating performance is looking at receiver operating characteristic (ROC) curves.\index{receiver operating characteristic (ROC) curve} The ROC curve is created by evaluating the class probabilities for the model across a continuum of thresholds. Typically, in the case of two-class classification, the methods return a probability for one of the classes. If that probability is higher than $0.5$, you call the label, for example, class A. If less than $0.5$, we call the label class B. However, we can move that threshold and change what we call class A or B. For each candidate threshold, the resulting sensitivity and 1-specificity are plotted against each other. The best possible prediction would result in a point in the upper left corner, representing 100% sensitivity (no false negatives) and 100% specificity (no false positives). For the best model, the curve will be almost like a square. Since this is important information, area under the curve (AUC) is \index{area under the curve (AUC)}calculated. This is a quantity between 0 and 1, and the closer to 1, the better the performance of your classifier in terms of sensitivity and specificity. For an uninformative classification model, AUC will be $0.5$. Although, ROC curves are initially designed for two-class problems, later extensions made it possible to use ROC curves for multi-class problems. +One important and popular metric when evaluating performance is looking at receiver operating characteristic (ROC) curves.\index{receiver operating characteristic (ROC) curve} The ROC curve is created by evaluating the class probabilities for the model across a continuum of thresholds. Typically, in the case of two-class classification, the methods return a probability for one of the classes. If that probability is higher than $0.5$, you call the label, for example, class A. If less than $0.5$, we call the label class B. However, we can move that threshold and change what we call class A or B. For each candidate threshold, the resulting sensitivity and 1-specificity are plotted against each other. The best possible prediction would result in a point in the upper left corner, representing 100% sensitivity (no false negatives) and 100% specificity (no false positives). For the best model, the curve will be almost like a square. Since this is important information, the area under the curve (AUC) is \index{area under the curve (AUC)}calculated. This is a quantity between 0 and 1, and the closer to 1, the better the performance of your classifier in terms of sensitivity and specificity. For an uninformative classification model, AUC will be $0.5$. Although ROC curves are initially designed for two-class problems, later extensions made it possible to use ROC curves for multi-class problems. ROC curves can also be used to determine alternate cutoffs for class probabilities for two-class problems. However, this will always result in a trade-off between sensitivity and specificity. Sometimes it might be desirable to limit the number of false positives because making such mistakes would be too costly for the individual cases. For example, if predicted with a certain disease, you might be recommended to have surgery. However, if your classifier has a relatively high false positive rate, low specificity, you might have surgery for no reason. Typically, you want your classification model to have high specificity and sensitivity, which may not always be possible in the real world. You might have to choose what is more important for a specific problem and try to increase that. @@ -496,7 +496,7 @@ Although the variable drop-out strategy will still be slow if you have a lot of A common hurdle in many applications of machine learning on genomic data is the large class imbalance. The imbalance refers to relative difference in the sizes of the groups being classified. For example, if we had class imbalance in our example data set we could have much more CIMP samples in the training than noCIMP samples, or the other way around. Another example with severe class imbalance would be enhancer prediction [@enhancerImbalance]. Depending on which training data set you use, you can have a couple of hundred to thousands of positive examples for enhancer locations in the human genome. In either case, the negative set, "not enhancer", will overwhelm the training, because the human genome is 3 billion base-pairs long and most of that does not overlap with an enhancer annotation. In whatever strategy you pick to build a negative set, it will contain many more data points than the positive set. As we have mentioned in the model performance section above, if we have a severe imbalance in the class sizes, the training algorithm may get better accuracy just by calling everything one class. This will be evident in specificity and sensitivity metrics, and the related balanced accuracy metric. Below, we will discuss a couple of techniques that might help when the training set has class imbalance. ### Sampling for class balance -If we think class imbalance is a problem based on looking at the relative sizes of the classes and relevant accuracy metrics of a model, there are a couple of things that might help. First, we can try sampling or "stratified" sampling when we are constructing our training set. This simply means that before training we can we build the classification model with samples of the data so we have the same size classes. This could be down-sampling the classes with too many data points. For this purpose, you can simply use the `sample()` or `caret::downSample()` function and create your training set prior to modeling. In addition, the minority class could be up-sampled for the missing number of data points using sampling with replacement similar to bootstrap sampling with the `caret::upSample()` function. There are more advanced up-sampling methods such as the synthetic up-sampling method SMOTE [@smote]. In this method, each data point from the minority class is up-sampled synthetically by adding variability to the predictor variable vector from one of the k-nearest neighbors of the data point. Specifically, one neighbor is randomly chosen and the difference between predictor variables of the neighbor and the original data point is added to the original predictor variables after multiplying the difference values with a random number between $0$ and $1$. This creates synthetic data points that are similar to original data points but not identical. This method and other similar methods of synthetic sampling are available at [`smotefamily`](https://cran.r-project.org/web/packages/smotefamily/index.html) package \index{R Packages!\texttt{smotefamily}} in CRAN. +If we think class imbalance is a problem based on looking at the relative sizes of the classes and relevant accuracy metrics of a model, there are a couple of things that might help. First, we can try sampling or "stratified" sampling when we are constructing our training set. This simply means that before training we can build the classification model with samples of the data so we have the same size classes. This could be down-sampling the classes with too many data points. For this purpose, you can simply use the `sample()` or `caret::downSample()` function and create your training set prior to modeling. In addition, the minority class could be up-sampled for the missing number of data points using sampling with replacement similar to bootstrap sampling with the `caret::upSample()` function. There are more advanced up-sampling methods such as the synthetic up-sampling method SMOTE [@smote]. In this method, each data point from the minority class is up-sampled synthetically by adding variability to the predictor variable vector from one of the k-nearest neighbors of the data point. Specifically, one neighbor is randomly chosen and the difference between predictor variables of the neighbor and the original data point is added to the original predictor variables after multiplying the difference values with a random number between $0$ and $1$. This creates synthetic data points that are similar to original data points but not identical. This method and other similar methods of synthetic sampling are available at [`smotefamily`](https://cran.r-project.org/web/packages/smotefamily/index.html) package \index{R Packages!\texttt{smotefamily}} in CRAN. In addition to the strategies above, some methods can do sampling during training to cope with the effects of class imbalance. For example, random forests has a sampling step during training, and this step can be altered to do stratified sampling. We will be introducing random forests later in the chapter. @@ -711,7 +711,7 @@ trctrl <- trainControl(method = "cv",number=10) enetFit <- train(subtype~., data = training, method = "glmnet", trControl=trctrl, - # alpha and lambda paramters to try + # alpha and lambda parameters to try tuneGrid = data.frame(alpha=0.5, lambda=seq(0.1,0.7,0.05))) @@ -845,7 +845,7 @@ In a practical sense, the number of nodes in the hidden layer (size) and some re We will train a simple neural network on our cancer data set. In this simple example, the network architecture is somewhat fixed. We can only the choose number of nodes (denoted by "size") in the hidden layer and a regularization parameter (denoted by "decay"). Increasing the number of nodes in the hidden layer or in other implementations increasing the number of the hidden layers, will help model non-linear relationships but can overfit. One way to combat that is to limit the number of nodes in the hidden layer; another way is to regularize the weights. The decay parameter does just that, it penalizes the loss function by $decay(weigths^2)$. In the example below, we try 1 or 2 nodes in the hidden layer in the interest of simplicity and run-time. In addition, we set `decay=0`, which will correspond to not doing any regularization. ```{r, nnet, eval=FALSE} -#svm code here +#nnet code here library(nnet) set.seed(17) @@ -942,7 +942,7 @@ set.seed(18) par(mfrow=c(1,2)) -# we are not going to do any cross-validatin +# we are not going to do any cross-validation # and rely on OOB error trctrl <- trainControl(method = "none") diff --git a/06-genomicIntervals.Rmd b/06-genomicIntervals.Rmd index 899a63e..56133bb 100644 --- a/06-genomicIntervals.Rmd +++ b/06-genomicIntervals.Rmd @@ -181,7 +181,7 @@ information including exon/intron structure of transcripts in a single line. We 2) **GFF**: GFF format is a tabular text format for genomic features similar to BED. However, it is a more flexible format than BED, which makes it harder to parse at times. Many gene annotation files are in this format. - `genomation::gffToGranges()` - - `rtracklayer::impot.gff()` + - `rtracklayer::import.gff()` 3) **BAM/SAM**: BAM format is a compressed and indexed tabular file format designed for aligned sequencing reads. SAM is the uncompressed version of the BAM file. We will touch upon BAM files in this chapter. The uncompressed SAM file is similar in spirit to a BED file where you have the basic location of chromosomal location information plus additional columns that are related to the quality of alignment or other relevant information. We will introduce this format in detail later in this chapter.\index{BAM file} \index{SAM file} @@ -258,7 +258,7 @@ The reads from sequencing machines are usually pre-processed and aligned to the ### Counting mapped reads for a set of regions -The `Rsamtools` package has functions to query BAM files\index{R Packages!\texttt{Rsamtools}}. The function we will use in the first example is the `countBam()` function, which takes input of the BAM file and param argument. The `param` argument takes a `ScanBamParam` object. The object is instantiated using `ScanBamParam()` and contains parameters for scanning the BAM file. The example below is a simple example where `ScanBamParam()` only includes regions of interest, promoters on chr21. +The `Rsamtools` package has functions to query BAM files\index{R Packages!\texttt{Rsamtools}}. The function we will use in the first example is the `countBam()` function, which takes as input of the BAM file and param argument. The `param` argument takes a `ScanBamParam` object. The object is instantiated using `ScanBamParam()` and contains parameters for scanning the BAM file. The example below is a simple example where `ScanBamParam()` only includes regions of interest, promoters on chr21. ```{r,countBam} promoter.gr=tss.gr @@ -447,7 +447,7 @@ assays(se)[[1]] # get the first table #### Subsetting -Subsetting is easy using `[ ]` notation. This is similar to the way we +Subsetting is easy using the `[ ]` notation. This is similar to the way we subset data frames or matrices. ```{r,subsetSe1} # subset the first five transcripts and first three samples @@ -481,7 +481,7 @@ different genomic datasets over that locus. This is similar to looking at the da over one of the genome browsers. Below we will display genes, GpG islands and read \index{R Packages!\texttt{Gviz}} coverage from a ChIP-seq experiment using the `Gviz` package\index{ChIP-seq}. For the `Gviz` package, we first need to set the tracks to display. The tracks can be in various formats. They can be R -objects such as `IRanges`,`GRanges` and `data.frame`, or they can be in flat file formats +objects such as `IRanges`, `GRanges` and `data.frame`, or they can be in flat file formats such as bigWig, BED, and BAM. After the tracks are set, we can display them with the `plotTracks` function, the resulting plot is shown in Figure \@ref(fig:GvizExchp6). diff --git a/07-Read_Processing.Rmd b/07-Read_Processing.Rmd index c502202..1633dc5 100644 --- a/07-Read_Processing.Rmd +++ b/07-Read_Processing.Rmd @@ -95,7 +95,7 @@ fastqc_install() fastqc(fq.dir = folder,qc.dir = "fastqc_results") ``` -Now that we have run FastQC on our fastq files, we can read the results to R and construct plots or reports. The `gc_report()` function can create an Rmarkdown-based report from FastQC output. +Now that we have run FastQC on our fastq files, we can read the results to R and construct plots or reports. The `qc_report()` function can create an Rmarkdown-based report from FastQC output. ```{r, fastqcr2,eval=FALSE} # view the report rendered by R functions qc_report(qc.path="fastqc_results", @@ -112,7 +112,7 @@ qc_plot(qc, "Per base sequence quality") ``` -Apart from this, the bioconductor packages Rqc [@Rqc] (see `Rqc::rqcReport` function), QuasR [@gaidatzis_quasr:_2015] (see `QuasR::qQCReport` function), systemPipeR [@backman_systempiper:_2016] (see `systemPipeR::seeFastq` function), and ShortRead [@morgan_shortread:_2009] (see `ShortRead::report` function) can all generate quality reports in a similar fashion to FastQC with some differences in plot content and number. +Apart from this, the Bioconductor packages Rqc [@Rqc] (see `Rqc::rqcReport` function), QuasR [@gaidatzis_quasr:_2015] (see `QuasR::qQCReport` function), systemPipeR [@backman_systempiper:_2016] (see `systemPipeR::seeFastq` function), and ShortRead [@morgan_shortread:_2009] (see `ShortRead::report` function) can all generate quality reports in a similar fashion to FastQC with some differences in plot content and number. ## Filtering and trimming reads Based on the results of the quality check, you may want to trim or filter the reads. The quality check might have shown the number of reads that have low quality scores. These reads will probably not align very well because of the potential mistakes in base calling, or they may align to wrong places in the genome. Therefore, you may want to remove these reads from your fastq file. Another potential scenario is that parts of your reads need to be trimmed in order to align the reads. In some cases, adapters will be present in either side of the read; in other cases technical errors will lead to decreasing base quality towards the ends of the reads. Both in these cases, the portion of the read should be trimmed so the read can align or better align the genome. We will show how to use the `QuasR` package to trim the reads. Other packages such as `ShortRead` also have capabilities to trim and filter reads. However, the `QuasR::preprocessReads()` function provides a single interface to multiple preprocessing possibilities. With this function, we match adapter sequences and remove them. We can remove low-complexity reads (reads containing repetitive sequences). We can trim the start or ends of the reads by a pre-defined length. diff --git a/08-rna-seq-analysis.Rmd b/08-rna-seq-analysis.Rmd index f863ecc..492dceb 100644 --- a/08-rna-seq-analysis.Rmd +++ b/08-rna-seq-analysis.Rmd @@ -189,7 +189,7 @@ selectedGenes <- names(V[order(V, decreasing = T)][1:100]) Now we can quickly produce a heatmap where samples and genes are clustered (see Figure \@ref(fig:tpmhierClust1) ). -```{r tpmhierClust1, fig.cap="Clustering and visualization of the topmost variable genes as a heatmap.",out.width = "50%"} +```{r tpmhierClust1, fig.cap="Clustering and visualization of the top most variable genes as a heatmap.",out.width = "50%"} library(pheatmap) pheatmap(tpm[selectedGenes,], scale = 'row', show_rownames = FALSE) ``` @@ -395,7 +395,7 @@ ggplot(data = as.data.frame(DEresults), aes(x = pvalue)) + ##### PCA plot A final diagnosis is to check the biological reproducibility of the sample replicates in a PCA plot or a heatmap. To plot the PCA results, we need to extract the normalized counts from the DESeqDataSet object. It is possible to color the points in the scatter plot by the variable of interest, which helps to see if the replicates cluster well (Figure \@ref(fig:DEpca)). -```{r DEpca, fig.cap='Principle component analysis plot based on top 500 most variable genes.'} +```{r DEpca, fig.cap='Principal component analysis plot based on top 500 most variable genes.'} library(DESeq2) # extract normalized counts from the DESeqDataSet object countsNormalized <- DESeq2::counts(dds, normalized = TRUE) @@ -488,7 +488,7 @@ kable(goResults[order(goResults$p.value), A gene set is a collection of genes with some common property. This shared property among a set of genes could be a GO term, a common biological pathway, a shared interaction partner, or any biologically relevant commonality that is meaningful in the context of the pursued experiment. Gene set enrichment analysis (GSEA) is a valuable exploratory analysis tool that can associate systematic changes to a high-level function rather than individual genes. Analysis of coordinated changes of expression levels of gene sets can provide complementary benefits on top of per-gene-based differential expression analyses. For instance, consider a gene set belonging to a biological pathway where each member of the pathway displays a slight deregulation in a disease sample compared to a normal sample. In such a case, individual genes might not be picked up by the per-gene-based differential expression analysis. Thus, the GO/Pathway enrichment on the differentially expressed list of genes would not show an enrichment of this pathway. However, the additive effect of slight changes of the genes could amount to a large effect at the level of the gene set, thus the pathway could be detected as a significant pathway that could explain the mechanistic problems in the disease sample. -We use the bioconductor package `gage` [@luo_gage:_2009] \index{R Packages!\texttt{gage}}to demonstrate how to do GSEA using normalized expression data of the samples as input. Here we are using only two gene sets: one from the top GO term discovered from the previous GO analysis, one that we compile by randomly selecting a list of genes. However, annotated gene sets can be used from databases such as MSIGDB [@subramanian_gene_2005], which compile gene sets from a variety of resources such as KEGG [@kanehisa_kegg_2016] and REACTOME [@fabregat_reactome_2018]. +We use the Bioconductor package `gage` [@luo_gage:_2009] \index{R Packages!\texttt{gage}}to demonstrate how to do GSEA using normalized expression data of the samples as input. Here we are using only two gene sets: one from the top GO term discovered from the previous GO analysis, one that we compile by randomly selecting a list of genes. However, annotated gene sets can be used from databases such as MSIGDB [@subramanian_gene_2005], which compile gene sets from a variety of resources such as KEGG [@kanehisa_kegg_2016] and REACTOME [@fabregat_reactome_2018]. ```{r gsea_setup} @@ -778,7 +778,7 @@ for(k in 1:4) { ``` Based on the separation of case and control samples in the PCA plots in Figure \@ref(fig:ruvsf1), -we can see that the samples are better separated even at k = 2 when using `RUVs()`. Here, we re-run the `RUVs()` function using k = 2, in order to do more diagnostic plots. We try to pick a value of k that is good enough to distinguish the samples by condition of interest. While setting the value of k to higher values could improve the percentage of explained variation by the first principle component to up to 61%, we try to avoid setting the value unnecessarily high to avoid removing factors that might also correlate with important biological differences between conditions. +we can see that the samples are better separated even at k = 2 when using `RUVs()`. Here, we re-run the `RUVs()` function using k = 2, in order to do more diagnostic plots. We try to pick a value of k that is good enough to distinguish the samples by condition of interest. While setting the value of k to higher values could improve the percentage of explained variation by the first principal component to up to 61%, we try to avoid setting the value unnecessarily high to avoid removing factors that might also correlate with important biological differences between conditions. ```{r ruv_s2} @@ -786,7 +786,7 @@ we can see that the samples are better separated even at k = 2 when using `RUVs( set_s <- RUVs(set, unique(rownames(set)), k=2, differences) # ``` -Now let's do diagnostics again: compare the count matrices with or without RUVs processing, comparing RLE plots (Figure \@ref(fig:ruvsf2)) and PCA plots (Figure \@ref(fig:ruvsf3)) to see the effect of RUVg on the normalization and separation of case and control samples. +Now let's do diagnostics again: compare the count matrices with or without RUVs processing, comparing RLE plots (Figure \@ref(fig:ruvsf2)) and PCA plots (Figure \@ref(fig:ruvsf3)) to see the effect of RUVs on the normalization and separation of case and control samples. ```{r ruvsf2,fig.width=8, fig.cap='RLE plots to observe the effect of RUVs.'} ## compare the initial and processed objects diff --git a/09-chip-seq-analysis.Rmd b/09-chip-seq-analysis.Rmd index a6ac25d..e67fadd 100644 --- a/09-chip-seq-analysis.Rmd +++ b/09-chip-seq-analysis.Rmd @@ -167,7 +167,7 @@ signal profile [@bonhoure_2014]. The choice of normalization method depends on the type of analysis [@angelini_2015]; if we want to quantitatively compare the abundance of different histone marks in -different cell types, we will need the different normalization procedure than if +different cell types, we will need a different normalization procedure than if we want to compare TF binding in the same setting. @@ -413,7 +413,7 @@ biological replicates; biological replicates should show greater similarity than ChIP of different proteins. The second function is to see whether our experiments conform to known prior knowledge. For example, we would expect to see greater similarity between proteins which belong to the same protein complex. -To quantify the ChIP signal we will firstly construct 1-kilobase-wide tilling +To quantify the ChIP signal we will firstly construct 1-kilobase-wide tiling windows over the genome, and subsequently count the number of reads in each window, for each experiment. We will then normalize the counts, to account for a different total number of reads in each experiment, and finally @@ -457,20 +457,20 @@ seqlengths = with(hg_chrs, setNames(size, chrom)) Then we construct the windows. -```{r sample-clustering.tilling_window} +```{r sample-clustering.tiling_window} # load the genomic ranges package library(GenomicRanges) # tileGenome function returns a list of GRanges of a given width, # spanning the whole chromosome -tilling_window = tileGenome(seqlengths, tilewidth=1000) +tiling_window = tileGenome(seqlengths, tilewidth=1000) # unlist converts the list to one GRanges object -tilling_window = unlist(tilling_window) +tiling_window = unlist(tiling_window) ``` -```{r sample-clustering.show_tilling_window, include=TRUE, echo=FALSE, eval=TRUE} -tilling_window +```{r sample-clustering.show_tiling_window, include=TRUE, echo=FALSE, eval=TRUE} +tiling_window ``` We will use the `summarizeOverlaps()` function from the `GenomicAlignments` package @@ -492,7 +492,7 @@ bam_files = list.files( ) # use summarizeOverlaps to count the reads -so = summarizeOverlaps(tilling_window, bam_files) +so = summarizeOverlaps(tiling_window, bam_files) # extract the counts from the SummarizedExperiment counts = assays(so)[[1]] @@ -503,7 +503,7 @@ Different ChIP experiments were sequenced to different depths; each experiment contains a different number of reads. To remove the effect of the experimental depth on the quantification, the samples need to be normalized. The standard normalization procedure, for ChIP data, is to divide the -counts in each tilling window by the total number of sequenced reads, and +counts in each tiling window by the total number of sequenced reads, and multiply it by a constant factor (to avoid extremely small numbers). This normalization procedure is called the **cpm**\index{cpm} - counts per million. @@ -724,7 +724,7 @@ export.bw(cov, 'output_file') #### Vizualization of track data using Gviz -We can create genome browserlike visualizations using the `Gviz` package, +We can create genome browser-like visualizations using the `Gviz` package, which was introduced in Chapter \@ref(genomicIntervals). The `Gviz` is a tool which enables exhaustive customized visualization of genomics experiments. The basic usage principle is to define tracks, where each track can represent @@ -906,7 +906,7 @@ different sequence composition is to look at whether regions with differing GC composition were equally enriched in all experiments. We will do the following: Firstly we will calculate the GC content of each -of the tilling windows, and then we will compare the GC content with the corresponding +of the tiling windows, and then we will compare the GC content with the corresponding cpm\index{cpm} (count per million reads) value, for each tile. @@ -922,7 +922,7 @@ seqlengths = with(hg_chrs, setNames(size, chrom)) # tileGenome produces a list per chromosome # unlist combines the elemenents of the list # into one GRanges object -tilling_window = unlist(tileGenome( +tiling_window = unlist(tileGenome( seqlengths = seqlengths, tilewidth = 1000 )) @@ -932,7 +932,7 @@ We will extract the sequence information from the `BSgenome.Hsapiens.UCSC.hg38` package. `BSgenome` are generic Bioconductor containers for genomic sequences. Sequences are extracted from the `BSgenome` container using the `getSeq()` function. The `getSeq()` function takes as input the genome object, and the ranges with the -regions of interest; in our case, the tilling windows. +regions of interest; in our case, the tiling windows. The function returns a `DNAString` object. @@ -941,7 +941,7 @@ The function returns a `DNAString` object. library(BSgenome.Hsapiens.UCSC.hg38) # extracts the sequence from the human genome -seq = getSeq(BSgenome.Hsapiens.UCSC.hg38, tilling_window) +seq = getSeq(BSgenome.Hsapiens.UCSC.hg38, tiling_window) ``` @@ -949,8 +949,8 @@ To calculate the GC content, we will use the `oligonucleotideFrequency()` functi `DNAString` object. By setting the width parameter to 2 we will calculate the **dinucleotide** frequency. Each row in the resulting table will contain the number of all possible -dinucleotides observed in each tilling window. -Because we have tilling windows of the same length, we do not +dinucleotides observed in each tiling window. +Because we have tiling windows of the same length, we do not necessarily need to normalize the counts by the window length. If all of the windows have different lengths (i.e. when at the ChIP-seq peaks), then normalization is a prerequisite. @@ -972,9 +972,9 @@ We will convert the cpm values to the log10 scale. To avoid taking the $log(0)$, we add a pseudo count of 1 to cpm. ```{r gc.cpm, cache=TRUE, warning=FALSE} -# counts the number of reads per tilling window +# counts the number of reads per tiling window # for each experiment -so = summarizeOverlaps(tilling_window, bam_files) +so = summarizeOverlaps(tiling_window, bam_files) # converts the raw counts to cpm values counts = assays(so)[[1]] @@ -1550,11 +1550,11 @@ hg_chrs = subset(hg_chrs, grepl('chr21$',chrom)) seqlengths = with(hg_chrs, setNames(size, chrom)) # define the windows -tilling_window = unlist(tileGenome(seqlengths, tilewidth=1000)) +tiling_window = unlist(tileGenome(seqlengths, tilewidth=1000)) # count the reads counts = summarizeOverlaps( - features = tilling_window, + features = tiling_window, reads = c(chip_file, control_file) ) @@ -1595,7 +1595,7 @@ higher enrichment in the ChIP samples, while the regions below the diagonal show higher enrichment in the Input samples. -Let us now perform for peak calling. `normR` usage is deceivingly simple; we need to provide the location ChIP and Control read files, and the genome version to the `enrichR()` function. The function will automatically create tilling windows (250bp by default), count the number of reads in each window, and fit a mixture of binomial distributions. +Let us now perform for peak calling. `normR` usage is deceivingly simple; we need to provide the location ChIP and Control read files, and the genome version to the `enrichR()` function. The function will automatically create tiling windows (250bp by default), count the number of reads in each window, and fit a mixture of binomial distributions. ```{r, peak-calling.sharp.peak-calling, message=FALSE, warning=FALSE} library(normr) @@ -1673,8 +1673,8 @@ write.table(ctcf_peaks, file.path(data_path, 'CTCF_peaks.txt'), We can now repeat the CTCF versus Input plot, and label significantly marked peaks. Using the count overlaps we mark which of our 1-kb regions contained significant peaks. ```{r peak-calling.sharp.peak-calling.countOvlaps} -# find enriched tilling windows -enriched_regions = countOverlaps(tilling_window, ctcf_peaks) > 0 +# find enriched tiling windows +enriched_regions = countOverlaps(tiling_window, ctcf_peaks) > 0 ``` @@ -1830,8 +1830,8 @@ control_file = file.path(data_path, 'GM12878_hg38_Input_r5.chr21.bam') ``` Because H3K36 regions span broad domains, it is necessary to increase the -tilling window size which will be used for counting. -Using the `countConfiguration()` function, we will set the tilling window size +tiling window size which will be used for counting. +Using the `countConfiguration()` function, we will set the tiling window size to 5000 base pairs. ```{r peak-calling.broad.config} @@ -1962,14 +1962,14 @@ to our peak regions. #### Percentage of reads in peaks To calculate the reads in peaks, we will firstly extract the number of reads -in each tilling window from the `normR` produced fit object. +in each tiling window from the `normR` produced fit object. This is done using the `getCounts()` function. -We will then use the q-value to define which tilling windows correspond +We will then use the q-value to define which tiling windows correspond to peaks, and count the number of reads within and outside peaks. ```{r peak-quality.counts} -# extract, per tilling window, counts from the fit object +# extract, per tiling window, counts from the fit object h3k36_counts = data.frame(getCounts(h3k36_fit)) # change the column names of the data.frame @@ -2026,7 +2026,7 @@ ggplot( axis.title = element_text(size=12,face="bold"), plot.title = element_text(hjust = 0.5)) + xlab('Experiment') + - ylab('Percetage of reads in region') + + ylab('Percentage of reads in region') + ggtitle('Percentage of reads in peaks for H3K36me3') + scale_fill_manual(values=c('gray','red')) ``` @@ -2249,7 +2249,7 @@ ggplot( axis.title = element_text(size=14,face="bold"), plot.title = element_text(hjust = 0.5)) + xlab('Peak rank') + - ylab('Percetage of peaks with motif') + + ylab('Percentage of peaks with motif') + ggtitle('Percentage of CTCF peaks with the CTCF motif') ``` @@ -2496,7 +2496,7 @@ ctcf_peaks = head(ctcf_peaks, n = 500) ctcf_peaks = reduce(ctcf_peaks) ``` -Create a region of $+/-$ 50 bp around the center of the peaks, +Create a region of $+/-$ 25 bp around the center of the peaks, ```{r motif-discovery.resize} # expand the CTCF peaks diff --git a/10-bs-seq-analysis.Rmd b/10-bs-seq-analysis.Rmd index f8b50f9..ba328aa 100644 --- a/10-bs-seq-analysis.Rmd +++ b/10-bs-seq-analysis.Rmd @@ -18,7 +18,7 @@ The epigenome consists of chemical modifications of DNA and histones. These modi ## What is DNA methylation? Cytosine methylation (5-methylcytosine, 5mC) is one of the main covalent base modifications in eukaryotic genomes, generally observed on CpG dinucleotides. Methylation can also rarely occur in a non-CpG context, but this was mainly observed in human embryonic stem and neuronal cells [@Lister2009-sd; @Lister2013-vs]. DNA methylation is a part of the epigenetic regulation mechanism of gene expression. It is cell-type-specific DNA modification. \index{DNA methylation}It is reversible but mostly remains stable through cell division. There are roughly 28 million CpGs in the human genome, 60–80% are generally methylated. Less than 10% of CpGs occur in CG-dense regions that are termed CpG islands in the human genome [@Smith2013-jh]. It has been demonstrated that DNA methylation is also not uniformly distributed over the genome, but rather is associated with CpG density. In vertebrate genomes, cytosine bases are usually unmethylated in CpG-rich regions such as CpG islands and tend to be methylated in CpG-deficient regions. Vertebrate genomes are largely CpG deficient except at CpG islands. Conversely, invertebrates such as _Drosophila melanogaster_ and _Caenorhabditis elegans_ do not exhibit cytosine methylation and consequently do not have CpG rich and poor regions but rather a steady CpG frequency over their genomes [@Deaton2011-pm]. -### How DNA methylation is set ? +### How is DNA methylation set ? DNA methylation is established by DNA methyltransferases DNMT3A and DNMT3B in combination with DNMT3L and maintained through cell division by the methyltransferase DNMT1 and associated proteins. DNMT3a and DNMT3b are in charge of the de novo methylation during early development. Loss of 5mC can be achieved passively by dilution during replication or exclusion of DNMT1 from the nucleus. Recent discoveries of the ten-eleven translocation (TET) family of proteins and their ability to convert 5-methylcytosine (5mC) into 5-hydroxymethylcytosine (5hmC) in vertebrates provide a path for catalyzed active DNA demethylation [@Tahiliani2009-ar]. Iterative oxidations of 5hmC catalyzed by TET result in 5-formylcytosine (5fC) and 5-carboxylcytosine (5caC). 5caC mark is excised from DNA by G/T mismatch-specific thymine-DNA glycosylase (TDG), which as a result reverts cytosine residue to its unmodified state [@He2011-pw]. Apart from these, mainly bacteria, but possibly higher eukaryotes, contain base modifications on bases other than cytosine, such as methylated adenine or guanine [@Clark2011-sc]. ### How to measure DNA methylation with bisulfite sequencing @@ -46,12 +46,12 @@ For the remainder of this chapter, we will explain how to do DNA methylation ana ## Processing raw data and getting data into R The rawest form of data that most users get is probably in the form of fastq files obtained from the sequencing experiments. We will describe the necessary steps and the tools that can be used for raw data processing and if they exist, we will mention their R equivalents. However, the data processing is usually done outside of the R framework, and for the following sections we will assume that the data processing is done and our analysis is starting from methylation call files. -The typical data processing step starts with a data quality check. The fastq files are first run through quality check software that shows the quality of the sequencing run. We would typically use [fastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) for this. However, there are several bioconductor packages that could be of use, such as [`Rqc`](https://bioconductor.org/packages/release/bioc/html/Rqc.html) and [`QuasR`](https://bioconductor.org/packages/release/bioc/html/QuasR.html). We have introduced how to use some of these tools for sequencing quality check in Chapter \@ref(processingReads). Following the quality check, provided everything is OK, the reads can be aligned to the genome. Before the alignment, adapters or low-quality ends of the reads can be trimmed to increase number of alignments. Low-quality ends mostly likely have poor basecalls, which will lead to many mismatches. Reads with non-trimmed adapters will also not align to the genome. We would use adapter trimming tools such as [cutadapt](https://cutadapt.readthedocs.io/en/stable/) or [flexbar](https://github.com/seqan/flexbar) for this purpose, although there are a bunch of them to choose from. Following this, reads are aligned to the genome with a bisulfite-treatment-aware aligner. For our own purposes, we use Bismark[@Krueger2011-vv], however there are other equally accurate aligners, and some are reviewed [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3378906/). In addition, the Bioconductor package [`QuasR`](https://bioconductor.org/packages/release/bioc/html/QuasR.html) can align BS-seq reads within the R framework. +The typical data processing step starts with a data quality check. The fastq files are first run through quality check software that shows the quality of the sequencing run. We would typically use [fastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) for this. However, there are several Bioconductor packages that could be of use, such as [`Rqc`](https://bioconductor.org/packages/release/bioc/html/Rqc.html) and [`QuasR`](https://bioconductor.org/packages/release/bioc/html/QuasR.html). We have introduced how to use some of these tools for sequencing quality check in Chapter \@ref(processingReads). Following the quality check, provided everything is OK, the reads can be aligned to the genome. Before the alignment, adapters or low-quality ends of the reads can be trimmed to increase number of alignments. Low-quality ends mostly likely have poor basecalls, which will lead to many mismatches. Reads with non-trimmed adapters will also not align to the genome. We would use adapter trimming tools such as [cutadapt](https://cutadapt.readthedocs.io/en/stable/) or [flexbar](https://github.com/seqan/flexbar) for this purpose, although there are a bunch of them to choose from. Following this, reads are aligned to the genome with a bisulfite-treatment-aware aligner. For our own purposes, we use Bismark [@Krueger2011-vv], however there are other equally accurate aligners, and some are reviewed [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3378906/). In addition, the Bioconductor package [`QuasR`](https://bioconductor.org/packages/release/bioc/html/QuasR.html) can align BS-seq reads within the R framework. After alignment, we need to call C->T conversions and calculate the fraction/percentage of methylation. Most of the time, aligners come with auxiliary tools that calculate per-base methylation values. Normally, they output a tabular format containing the location of the Cs and methylation value and strand. Within R, `QuasR`\index{R Packages!\texttt{QuasR}} and `methylKit` \index{R Packages!\texttt{methylKit}}can call methylation values from BAM files albeit with some limitations. In essence, these methylation call files can be easily read into R and downstream analysis within R starts from that point. An important quality measure at this stage is to look at the conversion rate. This simply means how many unmethylated Cs are converted to Ts. Since we expect non-CpG methylation to be rare, we can simply count the number of C->T conversions in the non-CpG context and calculate conversion rate. The best way to do this would be via spike-in sequences where we expect no methylation at all. Since non-CpG methylation is tissue specific, calculating the conversion rate using non-CpG Cs might be misleading in some cases. ## Data filtering and exploratory analysis -We assume that we start the analysis in R with the methylation call files. We will read those files in and carry out exploratory analysis, and we will show how to filter bases or regions from the data and in what circumstances we might need to do so. We will use the [methylKit](https://bioconductor.org/packages/release/bioc/html/methylKit.html)[@Akalin2012-af] package for the bulk of the analysis. \index{R Packages!\texttt{methylKit}} +We assume that we start the analysis in R with the methylation call files. We will read those files in and carry out exploratory analysis, and we will show how to filter bases or regions from the data and in what circumstances we might need to do so. We will use the [methylKit](https://bioconductor.org/packages/release/bioc/html/methylKit.html) [@Akalin2012-af] package for the bulk of the analysis. \index{R Packages!\texttt{methylKit}} ### Reading methylation call files A typical methylation call file looks like this: @@ -91,7 +91,7 @@ myobj=methRead(file.list, ) ``` -Tab-separated bedgraph like formats from Bismark methylation caller can also be read in by methylkit. In those cases, we have to provide either `pipeline="bismarkCoverage"` or `pipeline="bismarkCytosineReport"` to the `methRead()` function. In addition to the options we mentioned above, +Tab-separated bedgraph like formats from Bismark methylation caller can also be read in by methylKit. In those cases, we have to provide either `pipeline="bismarkCoverage"` or `pipeline="bismarkCytosineReport"` to the `methRead()` function. In addition to the options we mentioned above, any tab-separated text file with a generic format can be read in using methylKit, such as methylation ratio files from [BSMAP](http://code.google.com/p/bsmap/). See [here](http://zvfak.blogspot.com/2012/10/how-to-read-bsmap-methylation-ratio.html) for an example. @@ -112,7 +112,7 @@ distributions. Strong deviations from the bimodality may be due to poor experime getMethylationStats(myobj[[2]],plot=TRUE,both.strands=FALSE) ``` -In addition, we might want to see coverage values. By default, methylkit handles bases with at least 10X coverage but that can be changed. The bases with unusually high coverage are usually alarming. It might indicate a PCR bias issue in the experimental procedure. The general coverage statistics can be checked with the +In addition, we might want to see coverage values. By default, methylKit handles bases with at least 10X coverage but that can be changed. The bases with unusually high coverage are usually alarming. It might indicate a PCR bias issue in the experimental procedure. The general coverage statistics can be checked with the `getCoverageStats()` function shown below. The resulting plot is shown in Figure \@ref(fig:coverageStats). ```{r coverageStats,fig.cap="Histogram for log10 read counts per CpG."} @@ -219,7 +219,7 @@ When analyzing DNA methylation data, we usually look for regions that are differ Once methylation proportions per base are obtained, generally, the differences between methylation profiles are considered next. When there are multiple sample groups where each group defines a separate biological entity or treatment, it is usually of interest to locate bases or regions with different methylation proportions across the sample groups. The bases or regions with different methylation proportions across samples are called differentially methylated CpG sites (DMCs) and differentially methylated regions (DMRs). They have been shown to play a role in many different diseases due to their association with epigenetic control of gene regulation. In addition, DNA methylation profiles can be highly tissue-specific due to their role in gene regulation [@Schubeler2015-ai]. DNA methylation is highly informative when studying normal and diseased cells, because it can also act as a biomarker. For example, the presence of large-scale abnormally methylated genomic regions is a hallmark feature of many types of cancers [@Ehrlich2002-hv]. Because of the aforementioned reasons, investigating differential methylation is usually one of the primary goals of doing bisulfite sequencing. #### Fisher's exact test -Differential DNA methylation is usually calculated by comparing the proportion of methylated Cs in a test sample relative to a control. In simple comparisons between such pairs of samples (i.e. test and control), methods such as Fisher’s exact test can be used. If there are replicates, replicates can be pooled within groups to a single sample per group. This strategy, however, does not take into account biological variability between replicates. We will now show how to compare pairs of samples via the `calculateDiffMeth()` function in `methylKit`. When there is only one sample per sample group, `calculateDiffMeth()` automatically applies Fisher's exact test. We will now extract one sample from each group and run `calculateDiffMeth()`, which will automatically run Fisher's exact test. +Differential DNA methylation is usually calculated by comparing the proportion of methylated Cs in a test sample relative to a control. In simple comparisons between such pairs of samples (i.e. test and control), methods such as Fisher’s exact test can be used. If there are replicates, replicates can be pooled within groups to a single sample per group. This strategy, however, does not take into account biological variability between replicates. We will now show how to compare pairs of samples via the `calculateDiffMeth()` function in `methylKit`. When there is only one sample per group, `calculateDiffMeth()` automatically applies Fisher's exact test. We will now extract one sample from each group and run `calculateDiffMeth()`, which will automatically run Fisher's exact test. ```{r fishers,eval=FALSE} getSampleID(meth) new.meth=reorganize(meth,sample.ids=c("test1","ctrl1"),treatment=c(1,0)) @@ -232,7 +232,7 @@ dm.pooledf=calculateDiffMeth(pooled.meth) ``` -The `calculateDiffMeth()` function returns the P-values for all bases or regions in the input methylBase object. We need to filter to get differentially methylated CpGs. This can be done via the `getMethlyDiff()` function or simple filtering via `[ ]` notation. Below we show how to filter the `methylDiff` object output by the `calculateDiffMeth()` function in order to get differentially methylated CpGs. The function arguments define cutoff values for the methylation difference between groups and q-value. In these cases, we require a methylation difference of 25% and a q-value of at least $0.01$. +The `calculateDiffMeth()` function returns the P-values for all bases or regions in the input methylBase object. We need to filter to get differentially methylated CpGs. This can be done via the `getMethylDiff()` function or simple filtering via `[ ]` notation. Below we show how to filter the `methylDiff` object output by the `calculateDiffMeth()` function in order to get differentially methylated CpGs. The function arguments define cutoff values for the methylation difference between groups and q-value. In these cases, we require a methylation difference of 25% and a q-value of at least $0.01$. ```{r filter} # get differentially methylated bases/regions with specific cutoffs