-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathVectorizationIntro.rmd
83 lines (63 loc) · 3.37 KB
/
VectorizationIntro.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
Intro to Vectorization in R
===========================
### A.K.A. Do Nots USES da LOOPS
#### Created for August 2012 DC R Users Group Meetup
For me, one of the largest stumbling blocks in R was the idea of vectorization. The idea of a for loop is one of the most intuitive ideas in programing. If you program in bash or are heavy linux user it is second nature.
### In R, it is to be avoided. You do not want to explictaly call a for loop.
```{r echo=FALSE}
```
Here is a toy example
```{r}
df = data.frame(col1 = rnorm(1000),
col2 = rnorm(1000, 10),
col3 = rnbinom(n= 1000, size = 3, mu= 30))
#A simple data.frame
head(df)
```{r}
#Say we want the coloumn-wise mean of the data frame
out<-c()
for (i in 1:dim(df)[2]){
out[i] <- mean(df[,i])
names(out)[i]<- colnames(df)[i]
}
print(out)
```
In the words of the Bruno, nish-nish.
We are actually comminting two R-sins here. non-vectorized code and growing objects. What we want is to vectorize using one of the apply-family functions
```{r}
apply(df, MARGIN = 2, FUN = mean)
```
Or more simply with the user-friendly sapply variant
```{r}
sapply(df, mean)
```
### Let's look at what that just did.
'apply' functions are used to apply functions over arrays, matrixes or lists. In R you can pass functions as paramaters. The fancy-dancy CS term is that in R functions are first-class citizens. Get comfortable with it because it is used all over the place in R. This is akin to a call back function in async javascipt.
```
//example with Jquery
$.json('http://url/', function(data){
//do fun stuff with data
});
```
If this greek to you don't worry. Just be aware the when we call **sapply(df, mean)**. The mean that we are passing in is not an numeric object but a function. This a flavor of functional languages that is mixed into the R-soup. When we say no loops, obviously somewhere a lttle computer gnome has to loop through the data(thats how computers work right). But this looping is done in the C/FORTRAN code that underlies R and is generally faster. Now back to the **apply** function.
The three paramaters in the apply function are:
```
apply(
X = The object we are "looping over"
MARGIN = the 'axis' we are interest in traversing (column-wise or row-wise)
FUN = the function we want to apply
)
```
Truely if there is one point that I would like you to bring home. Regardless of the spurious lies your mother and Montessori teachers told you over the years, you are not that special or smart. What ever you are doing, it likely has been done before. So if you find yourself rewriting the R-wheel, make sure you poke around before.
```{r}
# the sane way of getting column-wise means
colMeans(df)
```
### So if you are thinking of a loop, don't
If you are doing this kind of data manipulation regularly it is well worth your time to investigate the plyr library. Excelect resource.
### More reading
* [R Inferno](http://www.burns-stat.com/pages/Tutor/R_inferno.pdf) - Highly recomend even if your not a Dante fan
* Especially Ch. 3 and 4
* [Great intro the Apply() Family](https://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/)
* [plyr](http://www.cerebralmastication.com/2009/08/a-fast-intro-to-plyr-for-r/) The R prophet Hadley Wickham's (peace be upon his name) excelent data manupulation package for dataframes
* Most reshaphing a data back flipping can be done with this great package.