This repository was archived by the owner on Mar 20, 2018. It is now read-only.
forked from swcarpentry/r-novice-gapminder
-
Notifications
You must be signed in to change notification settings - Fork 19
/
Copy path07-control-flow.Rmd
460 lines (414 loc) · 15 KB
/
07-control-flow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
---
title: Control Flow
teaching: 45
exercises: 20
questions:
- "How can I make data-dependent choices in R?"
- "How can I repeat operations in R?"
objectives:
- "Write conditional statements with `if()` and `else()`."
- "Write and understand `for()` loops."
keypoints:
- "Use `if` and `else` to make choices."
- "Use `for` to repeat operations."
source: Rmd
---
```{r, include=FALSE}
source("../bin/chunk-options.R")
knitr_fig_path("07-")
# Silently load in the data so the rest of the lesson works
gapminder <- read.csv("data/gapminder-FiveYearData.csv", header=TRUE)
# Silently set seed for random number generation, so we don't have to explain it
set.seed(10)
```
Often when we're coding we want to control the flow of our actions. This can be done
by setting actions to occur only if a condition or a set of conditions are met.
Alternatively, we can also set an action to occur a particular number of times.
There are several ways you can control flow in R.
For conditional statements, the most commonly used approaches are the constructs:
```{r, eval=FALSE}
# if
if (condition is true) {
perform action
}
# if ... else
if (condition is true) {
perform action
} else { # that is, if the condition is false,
perform alternative action
}
```
Say, for example, that we want R to print a message if a variable `x` has a particular value:
```{r}
# sample a random number from a Poisson distribution
# with a mean (lambda) of 8
x <- rpois(1, lambda=8)
if (x >= 10) {
print("x is greater than or equal to 10")
}
x
```
Note you may not get the same output as your neighbour because
you may be sampling different random numbers from the same distribution.
Let's set a seed so that we all generate the same 'pseudo-random'
number, and then print more information:
```{r}
set.seed(10)
x <- rpois(1, lambda=8)
if (x >= 10) {
print("x is greater than or equal to 10")
} else if (x > 5) {
print("x is greater than 5")
} else {
print("x is less than 5")
}
```
> ## Tip: pseudo-random numbers
>
> In the above case, the function `rpois()` generates a random number following a
> Poisson distribution with a mean (i.e. lambda) of 8. The function `set.seed()`
> guarantees that all machines will generate the exact same 'pseudo-random'
> number ([more about pseudo-random numbers](http://en.wikibooks.org/wiki/R_Programming/Random_Number_Generation)).
> So if we `set.seed(10)`, we see that `x` takes the value 8. You should get the
> exact same number.
{: .callout}
**Important:** when R evaluates the condition inside `if()` statements, it is
looking for a logical element, i.e., `TRUE` or `FALSE`. This can cause some
headaches for beginners. For example:
```{r}
x <- 4 == 3
if (x) {
"4 equals 3"
}
```
As we can see, the message was not printed because the vector x is `FALSE`
```{r}
x <- 4 == 3
x
```
> ## Challenge 1
>
> Use an `if()` statement to print a suitable message
> reporting whether there are any records from 2002 in
> the `gapminder` dataset.
> Now do the same for 2012.
>
> > ## Solution to Challenge 1
> > We will first see a solution to Challenge 1 which does not use the `any()` function.
> > We first obtain a logical vector describing which element of `gapminder$year` is equal to `2002`:
> > ```{r ch10pt1-sol, eval=FALSE}
> > gapminder[(gapminder$year == 2002),]
> > ```
> > Then, we count the number of rows of the data.frame `gapminder` that correspond to the 2002:
> > ```{r ch10pt2-sol, eval=FALSE}
> > rows2002_number <- nrow(gapminder[(gapminder$year == 2002),])
> > ```
> > The presence of any record for the year 2002 is equivalent to the request that `rows2002_number` is one or more:
> > ```{r ch10pt3-sol, eval=FALSE}
> > rows2002_number >= 1
> > ```
> > Putting all together, we obtain:
> > ```{r ch10pt4-sol, eval=FALSE}
> > if(nrow(gapminder[(gapminder$year == 2002),]) >= 1){
> > print("Record(s) for the year 2002 found.")
> > }
> > ```
> >
> > All this can be done more quickly with `any()`. The logical condition can be expressed as:
> > ```{r ch10pt5-sol, eval=FALSE}
> > if(any(gapminder$year == 2002)){
> > print("Record(s) for the year 2002 found.")
> > }
> > ```
> >
> {: .solution}
{: .challenge}
Did anyone get a warning message like this?
```{r, echo=FALSE}
if (gapminder$year == 2012) {}
```
If your condition evaluates to a vector with more than one logical element,
the function `if()` will still run, but will only evaluate the condition in the first
element. Here you need to make sure your condition is of length 1.
> ## Tip: `any()` and `all()`
>
> The `any()` function will return TRUE if at least one
> TRUE value is found within a vector, otherwise it will return `FALSE`.
> This can be used in a similar way to the `%in%` operator.
> The function `all()`, as the name suggests, will only return `TRUE` if all values in
> the vector are `TRUE`.
{: .callout}
## Repeating operations
If you want to iterate over
a set of values, when the order of iteration is important, and perform the
same operation on each, a `for()` loop will do the job.
We saw `for()` loops in the shell lessons earlier. This is the most
flexible of looping operations, but therefore also the hardest to use
correctly. Avoid using `for()` loops unless the order of iteration is important:
i.e. the calculation at each iteration depends on the results of previous iterations.
The basic structure of a `for()` loop is:
```{r, eval=FALSE}
for(iterator in set of values){
do a thing
}
```
For example:
```{r}
for(i in 1:10){
print(i)
}
```
The `1:10` bit creates a vector on the fly; you can iterate
over any other vector as well.
We can use a `for()` loop nested within another `for()` loop to iterate over two things at
once.
```{r}
for(i in 1:5){
for(j in c('a', 'b', 'c', 'd', 'e')){
print(paste(i,j))
}
}
```
Rather than printing the results, we could write the loop output to a new object.
```{r}
output_vector <- c()
for(i in 1:5){
for(j in c('a', 'b', 'c', 'd', 'e')){
temp_output <- paste(i, j)
output_vector <- c(output_vector, temp_output)
}
}
output_vector
```
This approach can be useful, but 'growing your results' (building
the result object incrementally) is computationally inefficient, so avoid
it when you are iterating through a lot of values.
> ## Tip: don't grow your results
>
> One of the biggest things that trips up novices and
> experienced R users alike, is building a results object
> (vector, list, matrix, data frame) as your for loop progresses.
> Computers are very bad at handling this, so your calculations
> can very quickly slow to a crawl. It's much better to define
> an empty results object before hand of the appropriate dimensions.
> So if you know the end result will be stored in a matrix like above,
> create an empty matrix with 5 row and 5 columns, then at each iteration
> store the results in the appropriate location.
{: .callout}
A better way is to define your (empty) output object before filling in the values.
For this example, it looks more involved, but is still more efficient.
```{r}
output_matrix <- matrix(nrow=5, ncol=5)
j_vector <- c('a', 'b', 'c', 'd', 'e')
for(i in 1:5){
for(j in 1:5){
temp_j_value <- j_vector[j]
temp_output <- paste(i, temp_j_value)
output_matrix[i, j] <- temp_output
}
}
output_vector2 <- as.vector(output_matrix)
output_vector2
```
> ## Tip: While loops
>
>
> Sometimes you will find yourself needing to repeat an operation until a certain
> condition is met. You can do this with a `while()` loop.
>
> ```{r, eval=FALSE}
> while(this condition is true){
> do a thing
> }
> ```
>
> As an example, here's a while loop
> that generates random numbers from a uniform distribution (the `runif()` function)
> between 0 and 1 until it gets one that's less than 0.1.
>
> ~~~
> z <- 1
> while(z > 0.1){
> z <- runif(1)
> print(z)
> }
> ~~~
> {: .r}
>
> `while()` loops will not always be appropriate. You have to be particularly careful
> that you don't end up in an infinite loop because your condition is never met.
{: .callout}
> ## Challenge 2
>
> Compare the objects output_vector and
> output_vector2. Are they the same? If not, why not?
> How would you change the last block of code to make output_vector2
> the same as output_vector?
>
> > ## Solution to Challenge 2
> > We can check whether the two vectors are identical using the `all()` function:
> > ```{r ch10pt6-sol, eval=FALSE}
> > all(output_vector == output_vector2)
> > ```
> > However, all the elements of `output_vector` can be found in `output_vector2`:
> > ```{r ch10pt7-sol, eval=FALSE}
> > all(output_vector %in% output_vector2)
> > ```
> > and vice versa:
> > ```{r ch10pt8-sol, eval=FALSE}
> > all(output_vector2 %in% output_vector)
> > ```
> > therefore, the element in `output_vector` and `output_vector2` are just sorted in a different order.
> > This is because `as.vector()` outputs the elements of an input matrix going over its column.
> > Taking a look at `output_matrix`, we can notice that we want its elements by rows.
> > The solution is to transpose the `output_matrix`. We can do it either by calling the transpose function
> > `t()` or by inputing the elements in the right order.
> > The first solution requires to change the original
> > ```{r ch10pt9-sol, eval=FALSE}
> > output_vector2 <- as.vector(output_matrix)
> > ```
> > into
> > ```{r ch10pt10-sol, eval=FALSE}
> > output_vector2 <- as.vector(t(output_matrix))
> > ```
> > The second solution requires to change
> > ```{r ch10pt11-sol, eval=FALSE}
> > output_matrix[i, j] <- temp_output
> > ```
> > into
> > ```{r ch10pt12-sol, eval=FALSE}
> > output_matrix[j, i] <- temp_output
> > ```
> {: .solution}
{: .challenge}
> ## Challenge 3
>
> Write a script that loops through the `gapminder` data by continent and prints out
> whether the mean life expectancy is smaller or larger than 50
> years.
>
> > ## Solution to Challenge 3
> >
> > **Step 1**: We want to make sure we can extract all the unique values of the continent vector
> > ```{r 07-chall-03-sol-a, eval=FALSE}
> > gapminder <- read.csv("data/gapminder-FiveYearData.csv")
> > unique(gapminder$continent)
> > ```
> >
> > **Step 2**: We also need to loop over each of these continents and calculate the average life expectancy for each `subset` of data.
> > We can do that as follows:
> >
> > 1. Loop over each of the unique values of 'continent'
> > 2. For each value of continent, create a temporary variable storing the life exepectancy for that subset,
> > 3. Return the calculated life expectancy to the user by printing the output:
> >
> > ```{r 07-chall-03-sol-b, eval=FALSE}
> > for( iContinent in unique(gapminder$continent) ){
> > tmp <- mean(subset(gapminder, continent==iContinent)$lifeExp)
> > cat("Average Life Expectancy in", iContinent, "is", tmp, "\n")
> > rm(tmp)
}
> > ```
> >
> > **Step 3**: The exercise only wants the output printed if the average life expectancy is less than 50 or greater than 50. So we need to add an `if` condition before printing.
> > So we need to add an `if` condition before printing, which evaluates whether the calculated average life expectancy is above or below a threshold, and print an output conditional on the result.
> > We need to amend (3) from above:
> >
> > 3a. If the calculated life expectancy is less than some threshold (50 years), return the continent and a statement that life expectancy is less than threshold, otherwise return the continent and a statement that life expectancy is greater than threshold,:
> >
> > ```{r 07-chall-03-sol-c, eval=FALSE}
> > thresholdValue <- 50
> >
> > for( iContinent in unique(gapminder$continent) ){
> > tmp <- mean(subset(gapminder, continent==iContinent)$lifeExp)
> >
> > if(tmp < thresholdValue){
> > cat("Average Life Expectancy in", iContinent, "is less than", thresholdValue, "\n")
> > }
> > else{
> > cat("Average Life Expectancy in", iContinent, "is greater than", thresholdValue, "\n")
> > } # end if else condition
> > rm(tmp)
> > } # end for loop
> >
> > ```
> {: .solution}
{: .challenge}
> ## Challenge 4
>
> Modify the script from Challenge 4 to loop over each
> country. This time print out whether the life expectancy is
> smaller than 50, between 50 and 70, or greater than 70.
>
> > ## Solution to Challenge 4
> > We modify our solution to Challenge 3 by now adding two thresholds, `lowerThreshold` and `upperThreshold` and extending our if-else statements:
> >
> > ```{r 07-chall-04-sol, eval=FALSE}
> > lowerThreshold <- 50
> > upperThreshold <- 70
> >
> > for( iCountry in unique(gapminder$country) ){
> > tmp <- mean(subset(gapminder, country==iCountry)$lifeExp)
> >
> > if(tmp < lowerThreshold){
> > cat("Average Life Expectancy in", iCountry, "is less than", lowerThreshold, "\n")
> > }
> > else if(tmp > lowerThreshold && tmp < upperThreshold){
> > cat("Average Life Expectancy in", iCountry, "is between", lowerThreshold, "and", upperThreshold, "\n")
> > }
> > else{
> > cat("Average Life Expectancy in", iCountry, "is greater than", upperThreshold, "\n")
> > }
> > rm(tmp)
> > }
> > ```
> {: .solution}
{: .challenge}
> ## Challenge 5 - Advanced
>
> Write a script that loops over each country in the `gapminder` dataset,
> tests whether the country starts with a 'B', and graphs life expectancy
> against time as a line graph if the mean life expectancy is under 50 years.
>
> > Solution for Challenge 5
> >
> > We will use the `grep` command that was introduced in the Unix Shell lesson to find countries that start with "B."
> > Lets understand how to do this first.
> > Following from the Unix shell section we may be tempted to try the following
> > ```{r 07-chall-05-sol-a, eval=FALSE}
> > grep("^B", unique(gapminder$country))
> > ```
> >
> > But when we evaluate this command it returns the indices of the factor variable `country` that start with "B."
> > To get the values, we must add the `value=TRUE` option to the `grep` command:
> >
> > ```{r 07-chall-05-sol-b, eval=FALSE}
> > grep("^B", unique(gapminder$country), value=TRUE)
> > ```
> >
> > We will now store these countries in a variable called candidateCountries, and then loop over each entry in the variable.
> > Inside the loop, we evaluate the average life expectancy for each country, and if the average life expectancy is less than 50 we use base-plot to plot the evolution of average life expectancy:
> >
> > ```{r 07-chall-05-sol-c, eval=FALSE}
> > thresholdValue <- 50
> > candidateCountries <- grep("^B", unique(gapminder$country), value=TRUE)
> >
> > for( iCountry in candidateCountries){
> > tmp <- mean(subset(gapminder, country==iCountry)$lifeExp)
> >
> > if(tmp < thresholdValue){
> > cat("Average Life Expectancy in", iCountry, "is less than", thresholdValue, "plotting life expectancy graph... \n")
> >
> > with(subset(gapminder, country==iCountry),
> > plot(year,lifeExp,
> > type="o",
> > main = paste("Life Expectancy in", iCountry, "over time"),
> > ylab = "Life Expectancy",
> > xlab = "Year"
> > ) # end plot
> > ) # end with
> > } # end for loop
> > rm(tmp)
> > }```
> {: .solution}
{: .challenge}