-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2-importing-data-into-r-with-readr.Rmd
331 lines (207 loc) · 13.2 KB
/
2-importing-data-into-r-with-readr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
# Importing Data into R with readr
<iframe width="560" height="315" src="https://www.youtube.com/embed/Oruk2xEFqFk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
## Instructions
This tutorial will teach you the basics of importing data from your hard drive into R. We will cover how to import a Comma-Separated Values (csv) file into R using the `read_csv()` function in the readr package. We will also be covering the different data types that R can recognize from a csv file, how to manipulate how data show up on R, as well as how to export the manipulated file back into a csv file.
Accompanying this tutorial is **a short [Google quiz](https://forms.gle/N9yDv5tkQMiGiUB29)** for your own self-assessment. The instructions of this tutorial will clearly indicate when you should answer which question.
## Learning Objectives
* Know how to import a Comma-Separated Values (csv) file from a hard drive into R.
* Understand the basics of `read_csv()` including how to use it to import data and how to manipulate the presentation of the data on R.
* Be familiar with the different data types that R can recognize.
* Know how to import other data files such as txt, xlsx, xpt, and sas into R.
* Know how to export a csv file from R into a hard drive.
## Set Up
In this tutorial, we will be using the readr package, so we will need to install and attach this package onto our R and R session. The readr package is part of a larger **tidyverse core**. This tidyverse core contains many R packages that give us access to functions that mainly work to organize data. In this tutorial series, we will be covering three packages from tidyverse: readr (tutorial 2), dplyr (tutorial 4), and ggplot (tutorial 5).
```{r}
#install.packages("readr")
library(readr)
```
For this tutorial, we will be using the `demo_csv.csv` file. This data is a subset of the National Health and Nutrition Examination Survey (NHANES) conducted by the National Center for Health Statistics (NCHS). Our `demo_csv.csv`, in particular, contains a portion of the information about the demographic of the survey's participants in the years 2013-2014. We will cover NHANES in more detail in tutorial 3. For now, you can explore NHANES in general by visiting this [website](https://wwwn.cdc.gov/nchs/nhanes/Default.aspx).
After attaching the readr package, one other thing that we need to complete this tutorial is a csv file. Note that the csv file should be in your working directory - this just makes out lives much easier when we want to import data from a hard drive onto R. Recall that we can use the function `dir()` to check if all of the files we need are in our working directory.
```{r}
dir()
```
You may be a bit confused about the output of the previous code. This is because we are working on Kaggle and the csv file that we will be using is not located in our working directory. More specifically, the csv file is in the *input* folder, whereas our working directory is the *output* folder.
```{r}
getwd()
```
There are two ways that we can approach this problem. We will go over how to do both in the next section of this tutorial.
#### DO QUESTION 1 OF THE QUIZ NOW {-}
* **REVIEW:** Which of the following functions lets us set a new working directory?
## Basics Of Importing a CSV File Into R
### Method 1: Setting a Different Working Directory
As we have sort of alluded to in tutorial 1, we can set our working directory as the location of the csv file to import it into R. After successfully importing the file, we can then set the working directory back to our original folder.
```{r}
setwd("./data")
```
Perfect! Now that we know the csv file is in our working directory, we can now import it using `read_csv()`. This function is relatively easy to use. All we need to do is add the name of our csv file along with the `.csv` extension in `""` within the brackets, and we are good to go!
```{r}
read_csv("data/demo_csv.csv")
```
You should see a list of "Column specification" and the `demo_csv.csv` file imported into a data frame in R after running the codes above. We will go over what "Column specification" means later in this tutorial.
Now that we have successfully imported our csv file into R, it is time for us to set our working directory back to our original directory.
```{r}
setwd("..")
```
```{r}
getwd()
```
### Method 2: Copying the Exact Pathway of the File
Another way for us to import the csv file into R is to copy and paste the exact pathway of the file into `read_csv()`. You should see the exact same "Column specification" and data frame as before!
```{r}
read_csv("data/demo_csv.csv")
```
Note that you can also store this imported data into an object using `<-`.
```{r}
DEMO <- read_csv("data/demo_csv.csv")
```
Now, we can just type DEMO to see the data frame.
```{r}
DEMO
```
#### DO QUESTIONS 2-4 OF THE QUIZ NOW {-}
* **REVIEW:** Which of the following codes will print the entire DEMO data frame?
* read_csv can also be used to import Excel and txt files. (True or False)
* Which R package does the function `read_csv()` belong to?
### [Try it yourself 3.1][3.1] {-}
Can you try importing the `bpx.csv` file into R using the function `read_csv()`?
### Key Notes About Importing Data into R
There are a few key things that we should note when using `read_csv()`:
1. The file name or pathway to the file needs to be in `""`,
2. The file extension, `.csv`, needs to be present, and
3. The name of the file needs to be **exact**.
The third point is related to one of the most common mistakes. When importing any data from your hard drive onto R, you need to make sure that the file name that you write in R is **exactly** what it displays on your hard drive. For instance, take note of spaces, capital letters, spelling of words, as well as the correct extensions. In other words, `demo_csv-1.csv` or `Demo_csv.csv` is much different than `demo_csv.csv`.
Another point to note is that `read_csv()` automatically assumes that the first row of your csv file is the header. We will learn how to tell R this assumption is not correct in section 3 of this tutorial.
#### DO QUESTION 5 OF THE QUIZ NOW {-}
* **REVIEW:** R is case sensitive. (True or False)
### [Try it yourself 3.2][3.2] {-}
Can you identify the mistakes of the following codes?
```{r}
# a.
# read_csv(../input/import/demo_csv.csv)
# b.
# read_csv("data/DEMO_csv.csv")
# c.
# Read_csv("data/demo_csv.csv")
# d.
# read_csv(data/"demo_csv.csv")
```
#### DO QUESTION 6 OF THE QUIZ NOW {-}
* Which of the following statements about `read_csv()` are correct? (select all that apply)
### Column Specification
You may notice that when you import a data into R by running `read_csv()`, a "Column specification" list appears. This list tells us two things:
1. The **names** of our columns and
2. The **type of data** that each column contains.
```{r}
read_csv("data/demo_csv.csv")
```
As we can see after running the code above, there are five columns in our data frame: id (the participant's unique ID number), gender, age, race, and edu (highest level of education).
We can also see that there are two types of data in this data frame `col_double()` and `col_character()`.
#### DO QUESTION 7 OF THE QUIZ NOW {-}
* Which of the following is the best an example of a data that would be classified as `col_double()`?
### [Try it yourself 3.3][3.3] {-}
Just by looking at the actual data frame, can you guess what type of data `col_double()` and `col_character()` are?
(**HINT**: doubles? integers? logical? character?)
## More Arguments Of read_csv
### Skip
There are a range of other arguments that we can use with `read_csv()`. Firstly, we can nest `skip` inside the `()` of `read_csv()` to tell R to skip (AKA not import) a certain number of rows when importing our data.
```{r}
Skip_2 <- read_csv("data/demo_csv.csv", skip = 2)
```
```{r}
Skip_2
```
```{r}
DEMO
```
When comparing the *Skip_2* table with our original *DEMO* table, we can see that *Skip_2* has two less rows. This is because the argument `skip = 2` has told R to not import the first two rows of our `demo.csv`.
#### DO QUESTION 8 OF THE QUIZ NOW {-}
* Which of the following statements is true about the argument `skip`?
### [Try it yourself 3.4][3.4] {-}
You may also notice that the header of *Skip_2* is incorrect. This is because R recognizes the header of our data as the first row, thus omiting it when importing `demo.csv` into R.
Let's say this is not what we really want. What we actually want to do is to remove the first two rows of actual data while keeping the header. What do you think we have to do to achieve this?
(**HINT**: Recall what we learn about extracting rows in tutorial 1)
### Remove Header & Header Names
Recall how `read_csv()` assumes that the first row of our data is the header. If this is not true, we can use `col_names = FALSE` to tell R that the first row of our data do not contain headers and that R should add headers for our data.
```{r}
No_header <- read_csv("data/demo_csv.csv",
col_names = FALSE)
```
```{r}
No_header
```
We can also change the names of our headers by using `col_names =` following by a vector of names. For example:
```{r}
Header_names <- read_csv("data/demo_csv.csv",
col_names = c("ID", "Gender", "Age", "Race", "Education"))
```
```{r}
Header_names
```
#### DO QUESTION 9 OF THE QUIZ NOW {-}
* In which of the following scenarios do you think we would **NEED** to use `col_names = FALSE`? (select all that apply)
With the added`col_names`, you may notice that the column specification for our data is not incorrect (everything is recognized as `col_character()`!
This is because R now reads "id", "gender", "age", "race", and "edu" as a content row, and since all of these are texts, R recognizes the entire column as `col_character()`. This is something worthy to note when you are importing data into R.
We can solve this problem with this solution:
```{r}
(Skip_and_Header_Names <- read_csv("data/demo_csv.csv",
skip = 1,
col_names = c("ID", "Gender", "Age", "Race", "Education")))
```
### Missing Values
We can also define missing values by using `na =`. For example, if we want to assign "Some college or AA degree" under the edu column as NA, we can use the following code:
```{r}
Missing_values <- read_csv("data/demo_csv.csv",
na = "Some college or AA degree")
```
```{r}
Missing_values
```
#### DO QUESTION 10 OF THE QUIZ NOW {-}
* Only characters can be assigned a value of NA, there is a different missing-value designation for numeric values. (True or False)
## Importing Other File Types into R
While csv is the most common file type to import into R, we can also import other types of data file into R using different functions. In this section, you will be introduced to the very basics of how to import txt, xlsx, xpt, and sas files into R.
### Text file (txt)
The simplest function that we can use to import a txt file is `read.table()`. This function belongs to the default Base R package, so we do not need to install or attach any packages before using it!
The first argument of this function is the file path. What do you think `header = TRUE` mean?
```{r}
read.table("data/demo_txt.txt", header = TRUE)
```
### Excel file (xlsx)
To import an xlsx file into R, we use `read_excel()`. But before we can use this function, we need to install and attach the readxl package. Similarly to `read.table()`, this function requires a file path.
```{r}
# install.packages("readxl")
library(readxl)
read_excel("data/demo_xlsx.xlsx")
```
### [Try it yourself 3.5][3.5] {-}
Import the `bpx.xlsx` into R using the `read_excel()` function.
### XPT File Extension
Another file type that you may need to import into R is xpt. To do this, we need the function `read.xport()` that belongs to the SASxport package.
```{r}
# install.packages("SASxport")
library(SASxport)
read.xport("data/demo_xpt.xpt")
```
### Statistical Analysis Software (SAS)
We can use the function `read_sas()` to import sas files into R. But before we do this, we need to install and attach the haven package.
```{r}
# install.packages("haven")
library("haven")
read_sas("data/demo_sas.sas")
```
## Exporting the Data Frame From R
After changing and manipulating our data in R, we can also export it back into a csv file to share it. To do this, we can use the function `write_csv()`. For example, let's say we want to export our "Missing_values" data frame.
```{r}
write_csv(Missing_values, "data/Missing Values.csv")
```
We can also export it to an Excel file by using `write_excel_csv()`.
```{r}
write_excel_csv(Missing_values, "data/Missing Values.xlsx")
```
Now if we check our working directory, there should be 2 new files, "Missing Values.csv" and "Missing Values.xlsx"!
```{r}
dir()
```
Congratulations! We have now succeeded in exporting a dataset from R to an external file! This will make our work much easier to share and access!
## Summary and Takeaways
In this tutorial, we have learned how to import csv files from our hard drive into R using `read_csv()`. This is an important first step in data analysis or manipulation since we need to be able to have the data in R in order to process it!