-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path02-advanced_data_handling.Rmd
531 lines (356 loc) · 31.3 KB
/
02-advanced_data_handling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
---
title: "01a-advanced data handling"
output:
html_document:
toc: yes
df_print: paged
html_notebook: default
pdf_document:
toc: yes
---
```{r, echo=FALSE}
library(knitr)
library(dplyr)
library(stringr)
options(scipen = F)
#This code automatically tidies code so that it does not reach over the page
opts_chunk$set(tidy.opts=list(width.cutoff=50),tidy=FALSE, rownames.print = FALSE, rows.print = 10, eval = TRUE, warning = FALSE, message = FALSE)
top10_track_streams <- c(163608, 126687, 120480, 110022, 108630, 95639, 94690, 89011, 87869, 85599)
top10_artist_names <- c("Axwell /\\ Ingrosso", "Imagine Dragons", "J. Balvin", "Robin Schulz", "Jonas Blue", "David Guetta", "French Montana", "Calvin Harris", "Liam Payne", "Lauv") # Characters have to be put in ""
top10_track_explicit <- c(0,0,0,0,0,0,1,1,0,0)
top10_track_explicit <- factor(top10_track_explicit,
levels = c(0:1),
labels = c("not explicit", "explicit"))
top10_artist_genre <- c("Dance","Alternative","Latino","Dance","Dance","Dance","Hip-Hop/Rap","Dance","Pop","Pop")
top10_artist_genre <- as.factor(top10_artist_genre)
top_10_track_release_date <- as.Date(c("2017-05-24", "2017-06-23", "2017-07-03", "2017-06-30", "2017-05-05", "2017-06-09", "2017-07-14", "2017-06-16", "2017-05-18", "2017-05-19"))
top10_track_explicit_1 <- c(FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,FALSE,FALSE)
music_data <- data.frame(top10_track_streams,
top10_artist_names,
top10_track_explicit,
top10_artist_genre,
top_10_track_release_date,
top10_track_explicit_1,
stringsAsFactors = FALSE)
```
## Advanced data handling
This chapter covers more advanced techniques for handling data in R. It is primarily based on
* Wickham, H., & Grolemund, G. (2016). R for Data Science - Import, Tidy, Transform, Visualize, and Model Data. O'Reilly Media.
### The dplyr package
There are many ways to achieve the same thing in R. Data transformation and handling tasks can be solved with the functions provided by base R (i.e. the functions that come with every R installation), but the ```dplyr```package offers a comprehensive suite of functions that makes many tasks easier, while keeping code very readable. As such, we will be presenting it here, but keep in mind that all of these tasks could also be achieved without any packages.
Before we continue, ensure that the ```dplyr()``` package is installed and loaded.
The ```dplyr()``` package includes six core functions that make many data handling tasks a lot easier:
* filter(): filter rows
* select(): select columns
* arrange(): re-order or arrange rows
* mutate(): create new columns
* summarise(): summarise values
* group_by(): allows for group operations
Each of these functions will be discussed in the following in more detail.
#### Filter rows
One of the most basic tasks one could wish to perform on a data set is select certain observations based on various characteristics. ```dplyr``` uses the ```filter()``` function to this end. To select certain rows from a data set you simply supply the data frame as the first argument and then tell ```filter()``` the logical criteria it should use to select observations. While this may sound fairly abstract, it will become very clear after a few examples.
Recall the ```music_data``` data frame from the previous chapter. Suppose we want to select only observations where the lyrics are not explicit. The code to achieve this would looks as follows:
```{r message=FALSE, warning=FALSE}
filter(music_data, top10_track_explicit == "not explicit")
```
The first argument supplied to the ```filter()``` function is the data frame we want to subset. The second argument tells ```filter()``` that we only want observations where the column ```top10_track_explicit``` is equal to the value ```"not explicit"```. If you look into the output you will notice that only tracks with non explicit lyrics have been returned.
Another way to filter observations is to choose all observations where one column is within a certain range. This can be achieved with the logical operators introduced in the basic data handling chapter. In the following example we select all tracks with less than 100000 streams.
```{r message=FALSE, warning=FALSE}
filter(music_data, top10_track_streams < 100000)
```
You can enforce multiple conditions with ```&```. The following example selects all observations with less than 150000 but more than 100000 streams.
```{r message=FALSE, warning=FALSE}
filter(music_data, top10_track_streams > 100000 & top10_track_streams < 150000)
```
The ```|``` symbol is the way R expresses "or". This way you can select observations that fulfill either one or the other condition. Say we would like to select all observations with less than 100000 or more than 150000 streams. The following code would do exactly that
```{r message=FALSE, warning=FALSE}
filter(music_data, top10_track_streams < 100000 | top10_track_streams > 150000)
```
A very useful feature of the ```filter()``` function is its ability to accept multiple criteria at once. Say we want to select all tracks marked as ```"not explicit"``` with less than 100000 streams. This can be achieved by simply supplying the function with additional arguments, as in the example below.
```{r message=FALSE, warning=FALSE}
filter(music_data, top10_track_explicit == "not explicit", top10_track_streams < 100000)
```
#### Select columns
Another common task is to select or exclude certain columns of a data frame. The ```dplyr``` package contains the ```select()``` function for exactly this purpose. Similarly to ```filter()``` you first supply the function with the data frame you wish to apply the selection to, followed by the columns you wish to select or exclude.
The following code selects the two columns ```top10_track_explicit``` and ```top10_track_streams``` from the ```music_data``` data set.
```{r message=FALSE, warning=FALSE}
select(music_data, top10_track_explicit, top10_track_streams)
```
To remove columns from a data frame you simply put a ```-``` before the column name.
```{r message=FALSE, warning=FALSE}
select(music_data, -top10_track_explicit, -top10_track_streams)
```
You can also select or exclude a whole range of columns through numbers or names.
```{r message=FALSE, warning=FALSE}
# Selects all columns from top10_track_explicit to top_10_track_release_date
select(music_data, top10_track_explicit:top_10_track_release_date)
# This is equivalent to
select(music_data, 3:5)
```
#### Arrange rows
If you just want to change the order of a data frame without discarding any observations or columns, you can use the ```arrange()``` function. It takes a data frame and a set of column names to order by, always in ascending order. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.
```{r message=FALSE, warning=FALSE}
arrange(music_data, top10_artist_genre, top10_track_streams)
```
If you wish to arrange them in descending order, you can wrap the column name in the ```desc()``` function.
```{r message=FALSE, warning=FALSE}
arrange(music_data, top10_artist_genre, desc(top10_track_streams))
```
#### Adding and changing variables
There are two functions to create new variables based on other variables in the dplyr package, ```mutate()``` and ```transmute()```. They both do the same thing, with one key difference: ```mutate()``` returns both the newly created variables _and_ the ones previously contained in the data frame, while ```transmute()``` only returns the newly created variables.
Both functions take variables already present in the data set and apply a function to them to generate new variables. This can be something as simple as adding 1 to every entry in a column or more complex procedures, like standardizing variables. The syntax is fairly straightforward: The first argument is the data frame we wish to apply the transformation to, and all the following arguments are the new variables we want to create. An example will make this clearer. Say we (for some reason) want to add 10 to every entry of the ```top10_track_streams``` column in our data set. The code to do this would look as follows:
```{r, message=FALSE, warning=FALSE}
mutate(music_data, streams_plus_10 = top10_track_streams + 10)
```
This code essentially defines a new column named ``` streams_plus_10```, which is the column ```top10_track_streams``` + 10.
As previously mentioned, we can also perform more complex operations, such as standardizing variables (i.e. subtracting the mean value and dividing by the standard deviation). Note that we are generating the mean and the standard deviation in our code by applying functions (```mean()``` and ```sd()```) to the entire column and then we use these values to perform the standardization on _each_ value of the column.
```{r, message=FALSE, warning=FALSE}
mutate(music_data, streams_standardised = (top10_track_streams - mean(top10_track_streams))/ sd(top10_track_streams))
```
Note that you could also use the ```scale()``` function to do the same:
```{r, message=FALSE, warning=FALSE}
mutate(music_data, streams_standardised = scale(top10_track_streams))
```
You can also add other vectors to an existing data frame with mutate, given that it is the same length as the data frame you want to add it to.
```{r, message=FALSE, warning=FALSE}
# create a vector of length 10
extra_column <- c(1,2,3,4,5,6,7,8,9,10)
mutate(music_data, new_data = extra_column)
```
If you don't want to add a new variable and only want to edit a variable already present in the data frame, ```dplyr``` has you covered with the ```mutate_at()``` function. The practical thing about this function is that it can also be applied to a whole range of variables if you want to perform the same operation on multiple columns. To do this, you give the function not just a single column name, but a whole vector of column names.
Say we first want to change the type of a single column. For example, maybe we want ```top10_track_explicit``` to be a character column and not a factor. We again tell ```mutate_at()``` which data frame we want to change, followed by the column(s) to change and finally the function we want to apply. You may have noticed that the column name is in quotation marks here, as opposed to the other functions we have met from the ```dplyr```package. This is a side effect of its ability to apply a function to a range of columns and not just a single one, as this can only be achieved via a character vector. So from a technical standpoint we are giving the ```mutate_at()``` function a character vector of length one, solely containing the element ```"top10_track_explicit"```.
```{r, message=FALSE, warning=FALSE}
mutate_at(music_data, "top10_track_explicit", as.character)
```
Now say we want to change multiple columns to be character vectors. To do this we will first create a vector of the names of the columns we want to apply the function to and then simply give ```mutate_at()```this vector as one of its arguments. If you do this, don't forget the ```c()``` function to create a vector.
```{r, warning=FALSE, message=FALSE}
columns <- c("top10_track_explicit", "top10_artist_genre", "top10_track_explicit_1")
mutate_at(music_data, columns, as.character)
```
Note that if you merely want to rename a variable without changing its content, you may use the ```rename()``` function to achieve this. The syntax may seem familiar at this point, with the first argument being the data frame to apply the function to and the following arguments being the transformations to apply. The example changes the names of the ```top10_track_explicit``` and ```top10_artist_names``` columns into ```explicit``` and ```names```, respectively.
```{r message=FALSE, warning=FALSE}
rename(music_data, explicit = top10_track_explicit, names = top10_artist_names)
```
#### Creating custom summaries
The ```summarise()``` function lets you build customized summaries of your data. This can range from creating means and standard deviations of certain variables to simply counting how many observations are in a data frame. Say we want to find out the mean and standard deviation of the number of streams and also count the number of observations. With ```summarise()``` that would look as follows:
```{r, message=FALSE, warning=FALSE}
summarise(music_data, n_observations = n(), mean_streams = mean(top10_track_streams), sd_streams = sd(top10_track_streams))
```
On its own, this function is not that impressive. After all, we could just apply the ```mean()```, ```sd()``` and ```nrow()``` functions individually and would have gotten the same result, albeit not in such a nice format. However, when combined with the ```group_by()``` function, ```summarise()``` becomes very useful as we will see next.
#### Group operations
The ```group_by()``` splits a data frame into groups, by the values of a column in the data frame. Say we wanted to calculate the mean and standard deviation of explicit and non-explicit songs separately.
```{r, message=FALSE, warning=FALSE}
music_data <- group_by(music_data, top10_track_explicit)
summarise(music_data, n_observations = n(), mean_streams = mean(top10_track_streams), sd_streams = sd(top10_track_streams))
```
#### Pipes
A very practical feature of the dplyr package are so called "pipes". Say you want to apply three of the previously mentioned functions to the same data frame. So far, the way we learned to do this would be as follows:
```{r, message=FALSE, warning=FALSE}
# First use select() to take only certain columns
music_data_new <- select(music_data, top10_track_explicit_1, top10_artist_names, top10_track_streams)
# Now use filter() to choose only rows that fulfill certain criteria
music_data_new <- filter(music_data_new, top10_track_streams < 100000)
# Then change order with arrange()
music_data_new <- arrange(music_data_new, top10_track_streams)
# Print to console
music_data_new
```
While this does achieve our objective, it is quite tedious. With the pipes offered by ```dplyr```, you can chain these commands together to streamline your code, while keeping it very readable. The symbol for a pipe is ```%>%```. From a technical perspective, this hands the preceding object to the next function as the first argument. This may sound complicated, but will become clear after an example. The code below will create exactly the same data frame as the example above, but in a much more compact form.
```{r, message=FALSE, warning=FALSE}
music_data_new <- music_data %>%
select(top10_track_explicit_1, top10_artist_names, top10_track_streams) %>%
filter(top10_track_streams < 100000) %>%
arrange(top10_track_streams)
# Print to console
music_data_new
```
Let's unpack what happened here. The first line "pipes" ```music_data``` into the first function, ```select()```, which is in the second line. Here we remove all columns except for ```top10_track_explicit_1```, ```top10_artist_names``` and ```top10_track_streams```. Then we take this data frame with the reduced columns and hand it to ```filter()```, which only selects observations with less than 100000 streams. Finally, we pass the filtered, column reduced data frame to arrange, which sorts the rows by the number of streams per track. The assignment operator (```<-```) at the top then saves this data frame in the environment as ```music_new_data```. Note that, in contrast to the previous examples, we no longer have to specify which data frame we want to apply the various functions to, as the pipes take care of this for us.
### Dealing with strings
Strings (which is short for "character strings"), can be tough to deal with. They are unstructured, messy and getting them into a format that one can perform analysis with is often a task that requires a lot of time. However, seeing as they appear fairly frequently in data sets and often contain valuable information, it is definitely worth the time to learn how to deal with them.
#### The ```stringr``` package
A very accessible package for manipulating strings is the ```stringr``` package. It is designed to be as uniform as possible, meaning that once you have understood the basic syntax of any one of its functions it is very easy to apply all of them. It sacrifices some flexibility for this simplicity, so if you ever encounter a task you can not easily solve with ```stringr``` it is worth checking out the package it is built on, ```stringi```. For now, however, ```stringr``` will be more than sufficient.
The majority of functions in ```stringr``` are built around two core arguments: a string to be worked on and a pattern. There are quite a few that aren't, such as ```str_length()```, but these are (for the most part) fairly self explanatory and will not be explained further here. A good overview of the included functions can be found [here](https://github.com/rstudio/cheatsheets/raw/master/strings.pdf).
The string to be worked on can either be an individual string in quotation marks or an entire vector or column of strings that the same operation should be applied to. The pattern can _technically_ also be a vector of patterns to look for, however, in this tutorial we will only use single patterns.
Recall the list of artist names in the ```music_data``` data frame.
```{r, message=FALSE, warning=FALSE}
music_data$top10_artist_names
```
Say, for example, we want to see which names contain a "g". ```stringr``` contains the function ```str_detect```, which tells you exactly that. Note that these functions are all case sensitive, i.e. the "G" in "David Guetta" is not detected.
```{r, message=FALSE, warning=FALSE}
str_detect(string = music_data$top10_artist_names, pattern = "g")
```
Say we want to be a bit more specific and want to know _exactly_ where the g is located in each artist name. ```str_locate_all``` returns a list of start and end values for each entry in the character vector.
```{r, warning=FALSE, message=FALSE}
str_locate_all(string = music_data$top10_artist_names, pattern = "g")
```
This obviously also works for patterns longer than just one letter.
```{r}
str_locate_all(string = music_data$top10_artist_names, pattern = "vin")
```
Another common task is to replace all instances of one character with another character. Say, for example, we want to replace all occurrences of "a" with "b". The function ```str_replace_all()``` does exactly that. Note that there is also a function called ```str_replace()```, but this only replaces the *first* match that the pattern finds.
```{r, message=FALSE, warning=FALSE}
str_replace_all(music_data$top10_artist_names, pattern = "a", replacement = "b")
```
A common application for this function is to replace all "." with "," or vice versa in price data. R, for example, needs the decimal separator to be a ".", so if you want to perform numerical analyses on price data that isn't in a uniform format, this function can be very useful.
Say we want to replace all "." with "," in the artist names.
```{r, message=FALSE, warning=FALSE}
str_replace_all(music_data$top10_artist_names, pattern = ".", replacement = ",")
```
This result may be somewhat surprising. Why did the function replace _every_ character with "," even though we specified ```pattern = "."```? The explanation for this can be found in how the function matches patterns.
#### A crash course in regex
Behind the scenes ```stringr``` uses something called regex (short for "regular expressions") to match patterns. regex allows you to match not just individual letters, but more abstract patterns. It does this by using special characters that do not match literally. You have already met the first such special character ".", which is regex's version of a wildcard, meaning that it matches _any_ other character. That's why, when we told stringr to replace all dots with commas, it replaced every character with a comma.
This raises the question of what to do if we want to literally match a dot and not any character. There are two ways to go about this. First, we could *escape* the wildcard character. What this means is that we use another special character to tell regex that the next character is meant to be taken literally and not as a wildcard. The symbol for this is a double back slash "\\\\". So, if we wanted to change only the dot to a comma we would write
```{r, message=FALSE, warning=FALSE}
str_replace_all(music_data$top10_artist_names, pattern = "\\.", replacement = ",")
```
The second way to achieve this would be by passing the pattern string through the ```fixed()``` function, which tells ```stringr``` that it should take the entire string literally. This means that the following code achieves the same result as using "\\\\".
```{r, message=FALSE, warning=FALSE}
str_replace_all(music_data$top10_artist_names, pattern = fixed("."), replacement = ",")
```
##### Other special characters in regex
We will quickly go through the most important regex special characters. Be aware that this list is by no means exhaustive and is only meant to give you some basic tools that can help you with string manipulation. Keep this in mind if a regex is displaying unexpected behavior, as it could be due to some wildcard you are not aware of. If all else fails you can always used the ```fixed()``` function to just match literal strings.
###### Square brackets ```[]```
Square brackets can be used to match from a set of different letters. This means that ```[abc]``` will match a, b or c. The following code will replace a, b or c with a capital X.
```{r}
str_replace_all(music_data$top10_artist_names, pattern = "[abc]", replacement = "X")
```
Note again that this is case sensitive, meaning that A, B and C are not replaced. However, square brackets are a great way to replace both capitalized and non-capitalized occurrences at once.
```{r}
str_replace_all(music_data$top10_artist_names, pattern = "[ABCabc]", replacement = "X")
```
If you include a ```^``` in the beginning of a square bracket pattern, regex will interpret that to mean any character _except_ the ones in brackets. This means that if we take the same code as before and include a ```^```, all letters except for A, B and C (and their non-capitalized counterparts) will be replaced by a capital X.
```{r}
str_replace_all(music_data$top10_artist_names, pattern = "[^ABCabc]", replacement = "X")
```
###### Repetition operators: ```*```,```+``` and ```{}```
Repetition operators can be used to match the same character (or set of characters) multiple times. ```+``` matches a character one or more times, ```*``` matches a character zero or more times and with ```{}``` you can specify the range that matches can occur in.
```{r}
vector <- c("", "a", "aa", "aaa", "aaaa")
```
```{r}
# Replace one or more a with an X
str_replace(vector, pattern = "a+", replacement = "X")
```
```{r}
# replace zero or more a with an X
str_replace(vector, pattern = "a*", replacement = "X")
```
```{r}
# replace exactly two a with an X
str_replace(vector, pattern = "a{2}", replacement = "X")
```
```{r}
# replace two to three a with an X
str_replace(vector, pattern = "a{2,3}", replacement = "X")
```
Note that the ```+``` and ```*``` operators are "greedy", meaning that they try to match as much as possible, which can often lead to unintended consequences. It is often a good practice with regex to be as specific as possible while remaining as general as needed.
###### Parentheses: ```()```
Parentheses are used to create groups. Groups always match in their entirety and can be combined with other operators.
```{r}
vector <- c("abc", "abcabc", "123abc", "abcabcabc")
str_replace_all(vector, pattern = "(abc){2}", replacement = "X")
```
The pattern ```"(abc){2}"``` will match only ```"abcabc"```, seeing as it looks for matches that repeat the group ```"(abc)"``` twice.
###### Optional characters: ```?```
The question mark tells regex that the preceding character is optional for a match.
```{r}
vector <- c("abc", "ac")
str_replace_all(vector, pattern = "abc", replacement = "X")
```
As expected, this only replaces the first element of the vector, as the second (```"ac"```) is not an exact match.
```{r}
str_replace_all(vector, pattern = "ab?c", replacement = "X")
```
By including ```?``` after the b, we tell regex that it is optional, i.e. that both ```"abc"``` and ```"ac"``` are correct matches. This can also be applied to groups and sets.
###### Anchors: ```^``` and ```$```
Anchors can be used to specify that a match should only occur at the very beginning or end of a character string, with ```^``` and ```$``` standing for the beginning and the end, respectively. Note that the ```^``` operator has a different meaning inside square brackets (```[]```), as discussed above.
```{r}
vector <- c("abc123", "123abc")
str_replace_all(vector, pattern = "^abc", replacement = "X")
```
This code only replaces the ```"abc"``` in ```"abc123"``` because it appears at the beginning of the string.
```{r}
str_replace_all(vector, pattern = "abc$", replacement = "X")
```
```abc$```, on the other hand, only matches the ```"abc"``` in ```"123abc"``` because it appears at the end of the string.
[regex can do a lot more than shown here](https://www.regular-expressions.info/tutorial.html), but these basic tools already enable you to do a lot of things that would take much more time when done by hand.
### Case study
Let's take everything we have learned in this chapter and apply it to a practical example. We will be using survey data from Qualtrics, which was created by a group of students for this course in 2017. We will only be looking at a small subsection of the variables available, to keep things from becoming unwieldy.
As always, let's first load all the required libraries and the data set and take a look at it.
```{r}
library(dplyr)
library(stringr)
data <- read.csv("https://raw.githubusercontent.com/IMSMWU/MRDA2018/master/Survey_data.csv?token=AVa281hYEyEQbqaSBcQZcYU-da4rv9xkks5bXyUMwA%3D%3D", stringsAsFactors = FALSE)
data
```
This data frame consists of 11 variables and 305 observations.
* _Progress:_ How much of the survey (in percent) was completed.
* _ResponseId:_ A unique ID for each participant
* _Q39:_ A multiple choice question on supermarket recognition. Participants were presented with 8 supermarket brands and asked which of them they were familiar with. A value of ```1,2,3,4,5,6,7,8``` means that the participant knew all eight, while, e.g., ```3,5``` means that the person only knew supermarkets 3 and 5.
* _Q18_1_ to Q18_1_8:_ A series of questions on willingness to pay for various products.
The data has a couple problems we need to take care of before we can start analyzing it properly.
* _The first two rows:_ Qualtrics data comes with two rows that contain no useful information. Additionally, these force all columns to be of type "character", which we can't perform all types of analyses on.
* _Not all respondents finished the survey:_ As you can tell by the progress column, not all respondents finished the survey. To be able to perform proper analysis, we only want those that completed the survey.
* _Multiple choice question:_ The multiple choice question is currently in a format that is very hard to work with. It would be best to have eight individual columns that each correspond to an individual supermarket.
* _The price data is a mess:_ The price data is arguably the hardest challenge. The data is not uniform, with the decimal separator symbol varying and some rows containing additional text and symbols that we do not need. We need to filter out only the relevant parts and then transform it from a character to a numeric column to work on it.
As a first step, we will filter out all observations where the progress column is unequal 100. This has the added advantage of removing the first two rows as well.
```{r}
data <- data %>%
filter(Progress == 100)
```
Next we want to turn the multiple choice question into something a bit more useful. We will create eight new columns, called ```Q39_A1``` to ```Q39_A8```, which contain a 1 if the respective number appears in ```Q39``` and a 0 otherwise. So, for example, if ```Q39```is equal to ```1,2,5``` then ```Q39_A1```, ```Q39_A2``` and ```Q39_A5``` will be set to 1 and all others will be set to 0.
We will achieve this with mutate and str_detect.
```{r}
data <- data %>%
mutate(Q39_A1 = str_detect(Q39, pattern = "1"),
Q39_A2 = str_detect(Q39, pattern = "2"),
Q39_A3 = str_detect(Q39, pattern = "3"),
Q39_A4 = str_detect(Q39, pattern = "4"),
Q39_A5 = str_detect(Q39, pattern = "5"),
Q39_A6 = str_detect(Q39, pattern = "6"),
Q39_A7 = str_detect(Q39, pattern = "7"),
Q39_A8 = str_detect(Q39, pattern = "8"))
```
First, we always define the name of the new column we want to create. Then we tell mutate the function with which to create the new column. In this case we use ```str_detect()``` to check if the correct number appears in the column ```Q39```. You can read the first line in ```mutate()``` as: "If a 1 appears in the field ```Q39``` set the column ```Q39_A1``` to ```TRUE``` and otherwise set it to ```FALSE```".
Now is a good time to take a look at the columns we have created and see if it did what we wanted.
```{r}
data %>%
select(Q39, Q39_A1:Q39_A8)
```
Everything seems to have worked! In a later step we will change the values ```TRUE``` and ```FALSE``` to 1 and 0, respectively, simply because it is easier to work with.
The last big task is to clean up the price data. Let's take a look at it.
```{r}
data %>%
select(Q18_1_1:Q18_1_8)
```
The problems can be grouped roughly into two categories:
* _Additional symbols:_ Some respondents added additional symbols or text that we need to get rid of.
* _Wrong decimal separator:_ R needs all decimal separators to be ```.``` and not ```,```, so we need to make sure all of them are dots.
We will tackle both of these problems with ```mutate_at()``` and the the ```stringr``` package. First we will strip out any character that isn't a number, a comma or a dot. ```str_remove_all``` removes all instances of characters that are matched by the expression given in "pattern". Note that the function ```str_remove_all``` is called without parentheses. Its argument(s) are passed directly to ```mutate_at()```, separated by commas. The pattern we use here is ```[^0-9,\\.]```, which can be read as "match all characters that are not (remember that a ```^``` in square brackets means 'everything but') ```0-9``` a comma (```,```) or a dot (```\\.```)". Remember that the dot is a special character and therefore has to be escaped with the double backslash.
```{r}
data <- data %>%
mutate_at(vars(Q18_1_1:Q18_1_8), str_remove_all, pattern = "[^0-9,\\.]")
```
Let's take a look and see if the code did what we wanted it to.
```{r}
data %>%
select(Q18_1_1:Q18_1_8)
```
So far, so good. To complete the clean up of the price data, we simply want to replace any instances of a comma with a dot. The code for this is quite simple.
```{r}
data <- data %>%
mutate_at(vars(Q18_1_1:Q18_1_8), str_replace_all, pattern = ",", replacement = ".")
# Print variables we just mutated to see if everything worked
data %>%
select(Q18_1_1:Q18_1_8)
```
As a final step we want to convert all variables that we want to work with into numeric variables. Again, we use ```mutate_at()```, this time coupled with the ```as.numeric()``` function.
```{r}
data <- data %>%
mutate_at(vars(Q18_1_1:Q39_A8), as.numeric)
```
Finally, let's have a look at the finished data frame. All the relevant data is now in a format that we can perform further analysis with.
```{r}
# Print entire data frame
data
```