Merge pull request #57 from federicomarini/master

Some typos fixed, plus dim of matrix + adding a missing word.
compgenomr · Sep 9, 2024 · 2121198 · 2121198
2 parents 08d3028 + fcfeea6
commit 2121198
Show file tree

Hide file tree

Showing 9 changed files with 79 additions and 79 deletions.
diff --git a/02-intro2R.Rmd b/02-intro2R.Rmd
@@ -104,8 +104,8 @@ Visualization is an important part of all data analysis techniques including com
  - Visualization of quantitative assays for given locus in the genome
 
 ## Getting started with R
-Download and install R (http://cran.r-project.org/) and RStudio (http://www.rstudio.com/) if you do not have them already. Rstudio is optional but it is a great tool if you are just starting to learn R.
-You will need specific data sets to run the code snippets in this book; we have explained how to install and use the data in the [Data for the book] section in the [Preface]. If you have not used Rstudio before, we recommend running it and familiarizing yourself with it first. To put it simply, this interface combines multiple features you will need while analyzing data. You can see your code, how it is executed, the plots you make, and your data all in one interface. 
+Download and install R (http://cran.r-project.org/) and RStudio (http://www.rstudio.com/) if you do not have them already. RStudio is optional but it is a great tool if you are just starting to learn R.
+You will need specific data sets to run the code snippets in this book; we have explained how to install and use the data in the [Data for the book] section in the [Preface]. If you have not used RStudio before, we recommend running it and familiarizing yourself with it first. To put it simply, this interface combines multiple features you will need while analyzing data. You can see your code, how it is executed, the plots you make, and your data all in one interface. 
 
 
 ### Installing packages
@@ -116,12 +116,12 @@ You can install CRAN packages using `install.packages()` (# is the comment chara
 # install package named "randomForests" from CRAN
 install.packages("randomForests")
 ```
-You can install bioconductor packages with a specific installer script.
+You can install Bioconductor packages with a specific installer script.
 ```{r installpack2,eval=FALSE}
 # get the installer package if you don't have
 install.packages("BiocManager")
 
-# install bioconductor package "rtracklayer"
+# install Bioconductor package "rtracklayer"
 BiocManager::install("rtracklayer")
 ```
 You can install packages from GitHub using the `install_github()` function from `devtools` package.
@@ -146,7 +146,7 @@ You can also update CRAN and Bioconductor packages.
 # updating CRAN packages
 update.packages()
 
-# updating bioconductor packages
+# updating Bioconductor packages
 if (!requireNamespace("BiocManager", quietly = TRUE))
     install.packages("BiocManager")
 BiocManager::install()
@@ -229,8 +229,8 @@ A matrix refers to a numeric array of rows and columns. You can think of it as a
 x<-c(1,2,3,4)
 y<-c(4,5,6,7)
 m1<-cbind(x,y);m1
-t(m1)                # transpose of m1
-dim(m1)              # 2 by 5 matrix
+t(m1)                # transposed of m1
+dim(m1)              # 4 by 2 matrix
 ```
 You can also directly list the elements and specify the matrix:
 ```{r matrix2}
@@ -379,7 +379,7 @@ We can modify all the plots by providing certain arguments to the plotting funct
 hist(x,main="Hello histogram!!!",col="red")
 ```
 
-Next, we will make a scatter plot. Scatter plots are one  the most common plots you will encounter in data analysis. We will sample another set of 50 values and plot those against the ones we sampled earlier. The scatter plot shows values of two variables for a set of data points. It is useful to visualize relationships between two variables. It is frequently used in connection with correlation and linear regression. There are other variants of scatter plots which show density of the points with different colors. We will show examples of those scatter plots in later chapters. The scatter plot from our sampling experiment is shown in Figure \@ref(fig:makeScatter). Notice that, in addition to `main` argument we used `xlab` and `ylab` arguments to give labels to the plot. You can customize the plots even more than this. See `?plot` and `?par` for more arguments that can help you customize the plots.
+Next, we will make a scatter plot. Scatter plots are one of the most common plots you will encounter in data analysis. We will sample another set of 50 values and plot those against the ones we sampled earlier. The scatter plot shows values of two variables for a set of data points. It is useful to visualize relationships between two variables. It is frequently used in connection with correlation and linear regression. There are other variants of scatter plots which show density of the points with different colors. We will show examples of those scatter plots in later chapters. The scatter plot from our sampling experiment is shown in Figure \@ref(fig:makeScatter). Notice that, in addition to `main` argument we used `xlab` and `ylab` arguments to give labels to the plot. You can customize the plots even more than this. See `?plot` and `?par` for more arguments that can help you customize the plots.
 
 ```{r makeScatter,out.width='50%',fig.width=5, fig.cap="Scatter plot example."}
 # randomly sample 50 points from normal distribution
@@ -595,7 +595,7 @@ The task above is a bit pointless. Normally in a loop, you would want to do some
 the data frame (by subtracting the end coordinate from the start coordinate).\index{R Programming Language!loops}
 
 
-**Note:**If you are going to run a loop that has a lot of repetitions, it is smart to try the loop with few repetitions first and check the results. This will help you make sure the code in the loop works before executing it thousands of times.
+**Note:** If you are going to run a loop that has a lot of repetitions, it is smart to try the loop with few repetitions first and check the results. This will help you make sure the code in the loop works before executing it thousands of times.
 
 ```{r forloop2}
 # this is where we will keep the lenghts
@@ -968,7 +968,7 @@ hist(x1,main="title")
 18. Do the same as above but this time with `par(mfrow=c(1,2))`. [Difficulty: **Beginner/Intermediate**] 
 
 
-19. Save your plot using the "Export" button in Rstudio. [Difficulty: **Beginner**] 
+19. Save your plot using the "Export" button in RStudio. [Difficulty: **Beginner**] 
 
 20. You can make a scatter plot showing the density
 of points rather than points themselves. If you use points it looks like this:

diff --git a/03-StatsForGenomics.Rmd b/03-StatsForGenomics.Rmd
@@ -320,7 +320,7 @@ original sample. Then, we calculate the parameter of interest, in this case the
 repeat this process a large number of times, such as 1000. At this point, we would have a distribution of re-sampled
 means. We can then calculate the 2.5th and 97.5th percentiles and these will
 be our so-called 95% confidence interval. This procedure, resampling with replacement to
-estimate the precision of population parameter estimates, is known as the __bootstrap resampling__ or __bootstraping__.\index{bootstrap resampling}
+estimate the precision of population parameter estimates, is known as the __bootstrap resampling__ or __bootstrapping__.\index{bootstrap resampling}
 
 Let's see how we can do this in practice. We simulate a sample
 coming from a normal distribution (but we pretend we don't know the 
@@ -761,20 +761,20 @@ dx=rowMeans(data[,group1])-rowMeans(data[,group2])
   
 require(matrixStats)
 
-# get the esimate of pooled variance 
+# get the estimate of pooled variance 
 stderr = sqrt( (rowVars(data[,group1])*(n1-1) + 
        rowVars(data[,group2])*(n2-1)) / (n1+n2-2) * ( 1/n1 + 1/n2 ))
 
 # do the shrinking towards median
 mod.stderr = (stderr + median(stderr)) / 2 # moderation in variation
 
-# esimate t statistic with moderated variance
+# estimate t statistic with moderated variance
 t.mod <- dx / mod.stderr
 
 # calculate P-value of rejecting null 
 p.mod = 2*pt( -abs(t.mod), n1+n2-2 )
 
-# esimate t statistic without moderated variance
+# estimate t statistic without moderated variance
 t = dx / stderr
 
 # calculate P-value of rejecting null 
@@ -955,7 +955,7 @@ $$
 \epsilon_1 \\
 \epsilon_2 \\ 
 \epsilon_3 \\ 
-\epsilon_0
+\epsilon_4
 \end{array}\right]
 $$
 
@@ -1550,7 +1550,7 @@ the issue.
 
 ##### Correlation of explanatory variables
 If the explanatory variables are correlated that could lead to something  
-known as multicolinearity. When this happens SE estimates of the coefficients will be too large. This is usually observed in time-course
+known as multicollinearity. When this happens SE estimates of the coefficients will be too large. This is usually observed in time-course
 data.
 
 ##### Correlation of error terms

diff --git a/04-unsupervisedLearning.Rmd b/04-unsupervisedLearning.Rmd
@@ -75,7 +75,7 @@ scale(df)
 ```
 
 
-### Hiearchical clustering
+### Hierarchical clustering
 This is one of the most ubiquitous clustering algorithms. Using this algorithm you can see the relationship of individual data points and relationships of clusters. This is achieved by successively joining small clusters to each other based on the inter-cluster distance. Eventually, you get a tree structure or a dendrogram that shows the relationship between the individual data points and clusters. The height of the dendrogram is the distance between clusters. Here we can show how to use this on our toy data set from four patients. The base function in R to do hierarchical clustering in `hclust()`. Below, we apply that function on Euclidean distances between patients. The resulting clustering tree or dendrogram is shown in Figure \@ref(fig:expPlot).\index{clustering!hierarchical clustering}
 ```{r toyClust,fig.cap="Dendrogram of distance matrix",out.width='50%'}
 d=dist(df)
@@ -175,7 +175,7 @@ type2kmclu = data.frame(
 table(type2kmclu)
 ```
 
-We cannot visualize the clustering from partitioning methods with a tree like we did for hierarchical clustering. Even if we can get the distances between patients the algorithm does not return the distances between clusters out of the box. However, if we had a way to visualize the distances between patients in 2 dimensions we could see the how patients and clusters relate to each other. It turns out that there is a way to compress between patient distances to a 2-dimensional plot. There are many ways to do this, and we introduce these dimension-reduction methods including the one we will use later in this chapter. For now, we are going to use a method called "multi-dimensional scaling" and plot the patients in a 2D plot color coded by their cluster assignments shown in Figure \@ref(fig:kmeansmds). We will explain this method in more detail in the [Multi-dimensional scaling] section below.
+We cannot visualize the clustering from partitioning methods with a tree like we did for hierarchical clustering. Even if we can get the distances between patients the algorithm does not return the distances between clusters out of the box. However, if we had a way to visualize the distances between patients in 2 dimensions we could see how patients and clusters relate to each other. It turns out that there is a way to compress between patient distances to a 2-dimensional plot. There are many ways to do this, and we introduce these dimension-reduction methods including the one we will use later in this chapter. For now, we are going to use a method called "multi-dimensional scaling" and plot the patients in a 2D plot color coded by their cluster assignments shown in Figure \@ref(fig:kmeansmds). We will explain this method in more detail in the [Multi-dimensional scaling] section below.
 
 ```{r, kmeansmds,out.width='50%',fig.cap="K-means cluster memberships are shown in a multi-dimensional scaling plot"}
 # Calculate distances
@@ -197,7 +197,7 @@ legend("bottomright",
 The plot we obtained shows the separation between clusters. However, it does not do a great job showing the separation between clusters 3 and 4, which represent CML and "no leukemia" patients. We might need another dimension to properly visualize that separation. In addition, those two clusters were closely related in the hierarchical clustering as well. 
 
 ### How to choose "k", the number of clusters
-Up to this point, we have avoided the question of selecting optimal number clusters. How do we know where to cut our dendrogram or which k to choose ? 
+Up to this point, we have avoided the question of selecting optimal number of clusters. How do we know where to cut our dendrogram or which k to choose ? 
 First of all, this is a difficult question. Usually, clusters have different granularity. Some clusters are tight and compact and some are wide, and both these types of clusters can be in the same data set. When visualized, some large clusters may look like they may have sub-clusters. So should we consider the large cluster as one cluster or should we consider the sub-clusters as individual clusters? There are some metrics to help but there is no definite answer. We will show a couple of them below.
 
 #### Silhouette 
@@ -534,7 +534,7 @@ As you might have noticed, we set again a random seed with the `set.seed()` func
 
 __Want to know more ?__
 
-- How perplexity affects t-sne, interactive examples:  https://distill.pub/2016/misread-tsne/
+- How perplexity affects t-SNE, interactive examples:  https://distill.pub/2016/misread-tsne/
 - More on perplexity: https://blog.paperspace.com/dimension-reduction-with-t-sne/
 - Intro to t-SNE: https://www.oreilly.com/learning/an-illustrated-introduction-to-the-t-sne-algorithm