Skip to content

Commit

Permalink
Merge pull request #57 from federicomarini/master
Browse files Browse the repository at this point in the history
Some typos fixed, plus dim of matrix + adding a missing word.
  • Loading branch information
al2na authored Sep 9, 2024
2 parents 08d3028 + fcfeea6 commit 2121198
Show file tree
Hide file tree
Showing 9 changed files with 79 additions and 79 deletions.
20 changes: 10 additions & 10 deletions 02-intro2R.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -104,8 +104,8 @@ Visualization is an important part of all data analysis techniques including com
- Visualization of quantitative assays for given locus in the genome

## Getting started with R
Download and install R (http://cran.r-project.org/) and RStudio (http://www.rstudio.com/) if you do not have them already. Rstudio is optional but it is a great tool if you are just starting to learn R.
You will need specific data sets to run the code snippets in this book; we have explained how to install and use the data in the [Data for the book] section in the [Preface]. If you have not used Rstudio before, we recommend running it and familiarizing yourself with it first. To put it simply, this interface combines multiple features you will need while analyzing data. You can see your code, how it is executed, the plots you make, and your data all in one interface.
Download and install R (http://cran.r-project.org/) and RStudio (http://www.rstudio.com/) if you do not have them already. RStudio is optional but it is a great tool if you are just starting to learn R.
You will need specific data sets to run the code snippets in this book; we have explained how to install and use the data in the [Data for the book] section in the [Preface]. If you have not used RStudio before, we recommend running it and familiarizing yourself with it first. To put it simply, this interface combines multiple features you will need while analyzing data. You can see your code, how it is executed, the plots you make, and your data all in one interface.


### Installing packages
Expand All @@ -116,12 +116,12 @@ You can install CRAN packages using `install.packages()` (# is the comment chara
# install package named "randomForests" from CRAN
install.packages("randomForests")
```
You can install bioconductor packages with a specific installer script.
You can install Bioconductor packages with a specific installer script.
```{r installpack2,eval=FALSE}
# get the installer package if you don't have
install.packages("BiocManager")
# install bioconductor package "rtracklayer"
# install Bioconductor package "rtracklayer"
BiocManager::install("rtracklayer")
```
You can install packages from GitHub using the `install_github()` function from `devtools` package.
Expand All @@ -146,7 +146,7 @@ You can also update CRAN and Bioconductor packages.
# updating CRAN packages
update.packages()
# updating bioconductor packages
# updating Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install()
Expand Down Expand Up @@ -229,8 +229,8 @@ A matrix refers to a numeric array of rows and columns. You can think of it as a
x<-c(1,2,3,4)
y<-c(4,5,6,7)
m1<-cbind(x,y);m1
t(m1) # transpose of m1
dim(m1) # 2 by 5 matrix
t(m1) # transposed of m1
dim(m1) # 4 by 2 matrix
```
You can also directly list the elements and specify the matrix:
```{r matrix2}
Expand Down Expand Up @@ -379,7 +379,7 @@ We can modify all the plots by providing certain arguments to the plotting funct
hist(x,main="Hello histogram!!!",col="red")
```

Next, we will make a scatter plot. Scatter plots are one the most common plots you will encounter in data analysis. We will sample another set of 50 values and plot those against the ones we sampled earlier. The scatter plot shows values of two variables for a set of data points. It is useful to visualize relationships between two variables. It is frequently used in connection with correlation and linear regression. There are other variants of scatter plots which show density of the points with different colors. We will show examples of those scatter plots in later chapters. The scatter plot from our sampling experiment is shown in Figure \@ref(fig:makeScatter). Notice that, in addition to `main` argument we used `xlab` and `ylab` arguments to give labels to the plot. You can customize the plots even more than this. See `?plot` and `?par` for more arguments that can help you customize the plots.
Next, we will make a scatter plot. Scatter plots are one of the most common plots you will encounter in data analysis. We will sample another set of 50 values and plot those against the ones we sampled earlier. The scatter plot shows values of two variables for a set of data points. It is useful to visualize relationships between two variables. It is frequently used in connection with correlation and linear regression. There are other variants of scatter plots which show density of the points with different colors. We will show examples of those scatter plots in later chapters. The scatter plot from our sampling experiment is shown in Figure \@ref(fig:makeScatter). Notice that, in addition to `main` argument we used `xlab` and `ylab` arguments to give labels to the plot. You can customize the plots even more than this. See `?plot` and `?par` for more arguments that can help you customize the plots.

```{r makeScatter,out.width='50%',fig.width=5, fig.cap="Scatter plot example."}
# randomly sample 50 points from normal distribution
Expand Down Expand Up @@ -595,7 +595,7 @@ The task above is a bit pointless. Normally in a loop, you would want to do some
the data frame (by subtracting the end coordinate from the start coordinate).\index{R Programming Language!loops}


**Note:**If you are going to run a loop that has a lot of repetitions, it is smart to try the loop with few repetitions first and check the results. This will help you make sure the code in the loop works before executing it thousands of times.
**Note:** If you are going to run a loop that has a lot of repetitions, it is smart to try the loop with few repetitions first and check the results. This will help you make sure the code in the loop works before executing it thousands of times.

```{r forloop2}
# this is where we will keep the lenghts
Expand Down Expand Up @@ -968,7 +968,7 @@ hist(x1,main="title")
18. Do the same as above but this time with `par(mfrow=c(1,2))`. [Difficulty: **Beginner/Intermediate**]


19. Save your plot using the "Export" button in Rstudio. [Difficulty: **Beginner**]
19. Save your plot using the "Export" button in RStudio. [Difficulty: **Beginner**]

20. You can make a scatter plot showing the density
of points rather than points themselves. If you use points it looks like this:
Expand Down
12 changes: 6 additions & 6 deletions 03-StatsForGenomics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -320,7 +320,7 @@ original sample. Then, we calculate the parameter of interest, in this case the
repeat this process a large number of times, such as 1000. At this point, we would have a distribution of re-sampled
means. We can then calculate the 2.5th and 97.5th percentiles and these will
be our so-called 95% confidence interval. This procedure, resampling with replacement to
estimate the precision of population parameter estimates, is known as the __bootstrap resampling__ or __bootstraping__.\index{bootstrap resampling}
estimate the precision of population parameter estimates, is known as the __bootstrap resampling__ or __bootstrapping__.\index{bootstrap resampling}

Let's see how we can do this in practice. We simulate a sample
coming from a normal distribution (but we pretend we don't know the
Expand Down Expand Up @@ -761,20 +761,20 @@ dx=rowMeans(data[,group1])-rowMeans(data[,group2])
require(matrixStats)
# get the esimate of pooled variance
# get the estimate of pooled variance
stderr = sqrt( (rowVars(data[,group1])*(n1-1) +
rowVars(data[,group2])*(n2-1)) / (n1+n2-2) * ( 1/n1 + 1/n2 ))
# do the shrinking towards median
mod.stderr = (stderr + median(stderr)) / 2 # moderation in variation
# esimate t statistic with moderated variance
# estimate t statistic with moderated variance
t.mod <- dx / mod.stderr
# calculate P-value of rejecting null
p.mod = 2*pt( -abs(t.mod), n1+n2-2 )
# esimate t statistic without moderated variance
# estimate t statistic without moderated variance
t = dx / stderr
# calculate P-value of rejecting null
Expand Down Expand Up @@ -955,7 +955,7 @@ $$
\epsilon_1 \\
\epsilon_2 \\
\epsilon_3 \\
\epsilon_0
\epsilon_4
\end{array}\right]
$$

Expand Down Expand Up @@ -1550,7 +1550,7 @@ the issue.

##### Correlation of explanatory variables
If the explanatory variables are correlated that could lead to something
known as multicolinearity. When this happens SE estimates of the coefficients will be too large. This is usually observed in time-course
known as multicollinearity. When this happens SE estimates of the coefficients will be too large. This is usually observed in time-course
data.

##### Correlation of error terms
Expand Down
8 changes: 4 additions & 4 deletions 04-unsupervisedLearning.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ scale(df)
```


### Hiearchical clustering
### Hierarchical clustering
This is one of the most ubiquitous clustering algorithms. Using this algorithm you can see the relationship of individual data points and relationships of clusters. This is achieved by successively joining small clusters to each other based on the inter-cluster distance. Eventually, you get a tree structure or a dendrogram that shows the relationship between the individual data points and clusters. The height of the dendrogram is the distance between clusters. Here we can show how to use this on our toy data set from four patients. The base function in R to do hierarchical clustering in `hclust()`. Below, we apply that function on Euclidean distances between patients. The resulting clustering tree or dendrogram is shown in Figure \@ref(fig:expPlot).\index{clustering!hierarchical clustering}
```{r toyClust,fig.cap="Dendrogram of distance matrix",out.width='50%'}
d=dist(df)
Expand Down Expand Up @@ -175,7 +175,7 @@ type2kmclu = data.frame(
table(type2kmclu)
```

We cannot visualize the clustering from partitioning methods with a tree like we did for hierarchical clustering. Even if we can get the distances between patients the algorithm does not return the distances between clusters out of the box. However, if we had a way to visualize the distances between patients in 2 dimensions we could see the how patients and clusters relate to each other. It turns out that there is a way to compress between patient distances to a 2-dimensional plot. There are many ways to do this, and we introduce these dimension-reduction methods including the one we will use later in this chapter. For now, we are going to use a method called "multi-dimensional scaling" and plot the patients in a 2D plot color coded by their cluster assignments shown in Figure \@ref(fig:kmeansmds). We will explain this method in more detail in the [Multi-dimensional scaling] section below.
We cannot visualize the clustering from partitioning methods with a tree like we did for hierarchical clustering. Even if we can get the distances between patients the algorithm does not return the distances between clusters out of the box. However, if we had a way to visualize the distances between patients in 2 dimensions we could see how patients and clusters relate to each other. It turns out that there is a way to compress between patient distances to a 2-dimensional plot. There are many ways to do this, and we introduce these dimension-reduction methods including the one we will use later in this chapter. For now, we are going to use a method called "multi-dimensional scaling" and plot the patients in a 2D plot color coded by their cluster assignments shown in Figure \@ref(fig:kmeansmds). We will explain this method in more detail in the [Multi-dimensional scaling] section below.

```{r, kmeansmds,out.width='50%',fig.cap="K-means cluster memberships are shown in a multi-dimensional scaling plot"}
# Calculate distances
Expand All @@ -197,7 +197,7 @@ legend("bottomright",
The plot we obtained shows the separation between clusters. However, it does not do a great job showing the separation between clusters 3 and 4, which represent CML and "no leukemia" patients. We might need another dimension to properly visualize that separation. In addition, those two clusters were closely related in the hierarchical clustering as well.

### How to choose "k", the number of clusters
Up to this point, we have avoided the question of selecting optimal number clusters. How do we know where to cut our dendrogram or which k to choose ?
Up to this point, we have avoided the question of selecting optimal number of clusters. How do we know where to cut our dendrogram or which k to choose ?
First of all, this is a difficult question. Usually, clusters have different granularity. Some clusters are tight and compact and some are wide, and both these types of clusters can be in the same data set. When visualized, some large clusters may look like they may have sub-clusters. So should we consider the large cluster as one cluster or should we consider the sub-clusters as individual clusters? There are some metrics to help but there is no definite answer. We will show a couple of them below.

#### Silhouette
Expand Down Expand Up @@ -534,7 +534,7 @@ As you might have noticed, we set again a random seed with the `set.seed()` func
__Want to know more ?__
- How perplexity affects t-sne, interactive examples: https://distill.pub/2016/misread-tsne/
- How perplexity affects t-SNE, interactive examples: https://distill.pub/2016/misread-tsne/
- More on perplexity: https://blog.paperspace.com/dimension-reduction-with-t-sne/
- Intro to t-SNE: https://www.oreilly.com/learning/an-illustrated-introduction-to-the-t-sne-algorithm
Expand Down
Loading

0 comments on commit 2121198

Please sign in to comment.