-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathR3_getting_and_cleaning_data.rmd
499 lines (321 loc) · 19.3 KB
/
R3_getting_and_cleaning_data.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
---
title: "R3 | Getting and Cleaning Data"
author: "Sean Mussenden"
date: "3/2/2022"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## What We're Doing
In this hourlong session, we're going to show you:
1. How to load flat data (tables or spreadsheets) into R in several different ways.
* Read in data that comes preloaded as part of a package with the `babynames` package.
* Read in a CSV file on your local machine with the `readr` package.
* Download a CSV file located somewhere on the Internet with the `readr` package.
* Read in data from an Excel file stored locally with the `readxl` package.
* Read in data from a Google Sheet with the `googlesheets4` package.
* Read in data from an HTML table from the Internet and coerce it into an R dataframe, a basic form of web scraping with the `rvest` package.
2. Introduce you to, but not teach in detail, information about alternate data loading methods, including packages that will help you:
* Pull data from an SQL database with the `dbplyr` package.
* Pull data from an API endpoint on the web with .
* Easily pull data from specific APIs (like Twitter or the U.S. Census).
3. Do some basic cleaning tasks once you've loaded data to prepare it for analysis.
* Bulk rename dirty column names, using the Janitor package.
* Convert column types, including numbers and text.
* Manipulate columns containing strings to make them usable.
This class isn't designed to make you an expert in any of these concepts. We're introducing them to you, as a foundation for further learning and giving you a sense of where to find more information.
### How We're Doing It
In the exercises below, we'll work through some examples together. Follow along with me.
### Getting Started
First, you'll need to know the location of the GitHub repository for this course.
It's https://github.com/walinchus/BabyNames. If you aren't on a NICAR computer, clone the repo and open this file.
You should also create a new RProject for this course, if you haven't in early sessions.. Among other things, it will set your working directory so local files load correctly. Go to File > New Project > Existing Directory and select the desktop folder for this course.
### Loading Libraries
We'll need to load several packages for this class. They are:
* The [Tidvyerse](https://www.tidyverse.org/) collection of packages. We'll be making extensive use of the readr, dplyr and stringr packages,which all load as part of the tidyverse core.
* [readr](https://readr.tidyverse.org/)
* [dplyr](https://dplyr.tidyverse.org/)
* [stringr](https://stringr.tidyverse.org/)
* The [rvest](http://rvest.tidyverse.org/) package for web scraping.
* The [janitor](https://github.com/sfirke/janitor) package for data cleaning.
* The [readxl](https://readxl.tidyverse.org/) package for loading Excel files.
* The [googlesheets4](https://googlesheets4.tidyverse.org/) package for reading -- and writing -- data stored in a Google Sheet.
* The [babynames](https://cran.r-project.org/web/packages/babynames/index.html) package for loading social security babyname data.
```{r}
## Install packages if you need them by uncommenting these lines
#install.packages('babynames')
#install.packages('tidyverse')
#install.packages('rvest')
#install.packages('janitor')
#install.packages('readxl')
#install.packages('googlesheets4')
#install.packages('babynames')
# Turn off scientific notation
options(scipen=999)
# Load libraries
library(tidyverse)
library(rvest)
library(janitor)
library(readxl)
library(googlesheets4)
library(babynames)
library(lubridate)
```
### Loading preloaded data from a package
If you've taken earlier classes in this sequence, you'll have loaded Social Security Administration data on names of babies born across the last ~100 years. It comes built in with the 'babynames' package we loaded.
To refresh, let's load that now and store it as an object called `babynames_package_df`.
We're going to load the same data from different locations, using different methods, to show you how some of the many ways to get data into R.
```{r}
# Load babynames
babynames_package_df <- babynames
# Show first 10 records
babynames_package_df %>%
head(10)
```
#### Loading data from a local flat file
Loading data from a flat file -- like a csv -- stored on your local machine is a very common data loading task. For that, we'll use the `read_csv()` function that's part of the `readr` package that loads with the `tidyverse`.
We're going to read in a sample of the babynames data that contains all of the names that start with "A".
Store it as an object called "babynames_a_csv" (the name doesn't matter, can be anything). The information inside of `read_csv()` is the filepath to the data, which is stored in the data > R3.
```{r}
babynames_a <- read_csv("data/R3/babynames_a.csv")
```
#### Loading data from a flat file on the internet
We can load that same csv from the internet, instead of doing it locally. We just need to change the information inside of the `read_csv()` function to pass a URL instead of a location on our computer. The CSV is up on Github.com, here: https://raw.githubusercontent.com/walinchus/BabyNames/master/data/R3/babynames_a.csv
```{r}
babynames_a <- read_csv("https://raw.githubusercontent.com/walinchus/BabyNames/master/data/R3/babynames_a.csv")
```
#### Loading data from an Excel file
Another data format you'll likely see in the wild: Excel files. They're a little trickier than csvs, because they can contain multiple sheets.
We can use the `read_xlsx()` function from the `readxl` package to load data from Excel files. There's an Excel file in the data/R3 folder called babynames_excel. It has two sheets, one with all of the "A" babynames, followed by one with all of the "B" babynames.
Let's load the As as a dataframe and store it as an object called babynames_a_excel.
```{r}
babynames_a_excel <- read_xlsx("data/R3/babynames_excel.xlsx")
```
Because "A" is the first sheet in the Excel file, the function loads the "A" sheet. But if we need to get any other sheet, like "B" we'll have to get it by adding another argument to the read_xlsx function.
Let's load B now and store it as an object called babynames_b_excel. Notice we've included the name of the sheet.
```{r}
babynames_b_excel <- read_xlsx("data/R3/babynames_excel.xlsx", sheet="babynames_b")
```
#### Loading data from google sheets
We can also load data directly from Google Sheets, using the `read_sheet()` function from the `googlesheets4` package. Google Sheets, like Excel files, can also have multiple workbooks.
I've loaded up a Google Sheet workbook here with two sheets, one for "A" babynames and one for "B" babynames: https://docs.google.com/spreadsheets/d/1GG_RmYPGCNKbLb4x1dkQv1B9NjDBF0kQRtge9Aea4rs. These samples only have 1,000 records in them.
To read in the sheet, we need to paste in the sheet ID, which is the long sting of numbers and letters at the end of the URL: "1GG_RmYPGCNKbLb4x1dkQv1B9NjDBF0kQRtge9Aea4rs"
Because the "A" sheet comes first, it automatically loads that sheet.
```{r}
babynames_a_sheet <- read_sheet("1GG_RmYPGCNKbLb4x1dkQv1B9NjDBF0kQRtge9Aea4rs")
```
If we want to load any other sheet, we need to tell it which one to load.
```{r}
babynames_b_sheet <- read_sheet("1GG_RmYPGCNKbLb4x1dkQv1B9NjDBF0kQRtge9Aea4rs", sheet = "babynames_b")
```
A few notes on the `googlesheets4` package.
* If you want to access a sheet that isn't public on the web, you'll have to authenticate R studio with Google first. You can do that by running `gs4_auth()`.
* In addition to loading data from Google Sheets, you can also WRITE data from R Studio to Google sheets. This is such a great way to share data you've analyzed with other, less tech-savvy members of your team. More on that here: https://googlesheets4.tidyverse.org/index.html.
#### Loading data from an HTML table
On this page is an HTML table showing the top 2020 baby names. https://www.ssa.gov/oact/babynames/
We can use the inspector in our web browser to see this. It's composed of html tags. Using the `rvest` package we're going to first read in the entire html page using the `read_html()` function.
```{r}
top_2020_baby_names <- read_html("https://www.ssa.gov/oact/babynames/")
```
Open it up in the environment window. It's the full HTML of the page.
Next, we'll use `rvest` to extract the html table we want using `html_table()`
```{r}
top_2020_baby_names <- read_html("https://www.ssa.gov/oact/babynames/") %>%
html_table()
```
Open it up in the environment window. We've isolated that one html table and turned it into a dataframe, albeit one that is nested in a list. Let's extract it from the list.
```{r}
# Load html and extract table
top_2020_baby_names <- read_html("https://www.ssa.gov/oact/babynames/") %>%
html_table()
# Pluck out the table
top_2020_baby_names <- top_2020_baby_names[[1]]
# display it
top_2020_baby_names
```
Never web scraped before? Now you have! A warning that web scraping does get more challenging from here.
#### Other ways of getting data
There are lots of other ways to bring data into R. We don't have to delve into all of them today, but you should be aware of them for continued learning.
* Pull data from an SQL database with the `dbplyr` package. [dbplyr](https://dbplyr.tidyverse.org/)
* Get data from an API endpoint - You can access data available through pretty much any API, or application programming interface, by sending queries and getting back a response, using the excellent [HTTR package](https://httr.r-lib.org/). Some examples of APIs include:
* [Spotify data](https://developer.spotify.com/documentation/web-api/)
* [Twitter's API](https://developer.twitter.com/en/docs)
* Get data from packages the R community has built to load and work with specific kinds of data, including some nice helper functions. If the dataset you want to use is popular, there's a great change someone has built an R package for it! Some great examples:
*[Tidycensus](https://walkerke.github.io/tidycensus/index.html): for loading and working with all manner of U.S. Census Data.
*[RTweet](https://rtweet.info/): for working with Twitter data
#### Clear enivronment
Before moving on to the data cleaning section, let's clean up all those dataframes in our environment with this handy bit of code.
```{r}
rm(list=ls())
```
#### Data Cleaning
Data cleaning is a critical step in preparing data for analysis. Everyone wants to jump right to the fun stuff -- me too! -- but failing to account for flaws in the data can lead to errors, which can lead to corrections.
To start, we're going to load a dataframe with info on year-by-year totals for the most popular female baby names from 2008 to 2017. I've intentionlly made this very ugly.
```{r}
babynames_dirty <- read_rds("data/R3/babynames_dirty.rds")
```
#### Identify Issues
Examine the data in the environment window.
What problems do you see? Don't scroll down until you think you've found one or two of the problems.
Are you sure you can't find more?
Are you sure? Ok. Here's what I see.
* EMPTY ROWS - There's a totally empty row of NAs. Let's get rid of it. It could prevent proper summarizing.
* EMPTY COLUMN - There's a totally empty column called source. Let's get rid of it, because it's unneeded.
* INCONSISTENT VALUES - The "sex" column has inconsistent values for Female vs F. This could hurt our ability to group and summarize properly.
* STYLING INCONSISTENCIES - There are lots of issues in "name_at_birth" column. MADISON is upper. elizabeth is lower. Abigail has "First Name: Abigail" instead of just Abigail. Everything else is title case. This is just...ugly.
* DUPLICATE ROWS - There are two apparently identical rows for Emma and Amelia (or are they?). We should really only have one, and could lead to misleading summarization if we kept both.
* NUMBERS STORED AS CHARACTER STRINGS - The 2015 column has a random underscore appended to the end of the number, unlike every other year column. The fact that it's a mix of characters and numbers means the column is stored as a "character", while all of the other year columns are numbers (integers). If we want to calculate say, the percent change in a name between 2008 and 2015, we'll get an error.
* BAD COLUMN NAMES - Less obvious, but still annoying: the fact that the year columns start with a number means they'll be harder to include in our code.
Let's examine each of these, while fixing them.
Our goal will be one nice long chained function to fix all the problems, so we have a clear record of what we've done.
#### Remove empty rows and columns
First, let's get rid of that totally NA row and column. The `janitor` package has a nice function for getting rid of empty rows and columns
```{r}
babynames_clean <- babynames_dirty %>%
remove_empty()
# just empty rows? remove_empty(which="rows")
# just empty columns? remove_empty(which="columns")
# both rows and columns? remove_empty()
```
#### Remove duplicates
Right now, we have two identical rows with values for "Amelia" and "Emma". We can remove those by using the `distinct()` function, which looks for duplicate values and keeps only one.
```{r}
babynames_clean <- babynames_dirty %>%
remove_empty() %>%
distinct()
```
That appears to have removed the duplicate "Amelia" rows, but not the duplicate "Emma" rows.
Why? Because, if we examine the Emma row closely, we'll see that one has extra whitespace on both sides of the name, and the other doesn't. That means the rows aren't EXACTLY equal.
We can fix that with our first of many string functions, from the `stringr` package, which has lots of helpful functions for cleaning character strings. The function we'll use is called `str_trim()` and it's good at trimming extra whitespace.
```{r}
babynames_clean <- babynames_dirty %>%
remove_empty() %>%
mutate(name_at_birth = str_trim(name_at_birth,side="both")) %>%
distinct()
```
Now when we run it again, `distinct()` gets rid of the extra Emma row because the difference has been fixed.
#### Clean up names column
There are still lots of issues with the "name_at_birth" column.
MADISON is UPPERCASE; elizabeth is lowercase; everything else is Title Case.
Let's fix that first with a function called `str_to_title()`, which will convert everything to title case.
```{r}
babynames_clean <- babynames_dirty %>%
remove_empty() %>%
mutate(name_at_birth = str_trim(name_at_birth,side="both")) %>%
distinct() %>%
mutate(name_at_birth = str_to_title(name_at_birth))
```
The next issue will be a bit more challenging, fixing Abigail. Her name has extra stuff -- "First Name: Abigail" -- that the others don't have. We can remove that extra stuff with `str_remove()`.
```{r}
babynames_clean <- babynames_dirty %>%
remove_empty() %>%
mutate(name_at_birth = str_trim(name_at_birth,side="both")) %>%
distinct() %>%
mutate(name_at_birth = str_to_title(name_at_birth)) %>%
mutate(name_at_birth = str_remove(name_at_birth,"First Name: "))
```
#### Standardize "Sex" Column
We can use one another string function, `str_sub()` to standardize the "sex" column. This function in the block below says, "Starting at the left side of the sex column, keep everything from the first character on the left to...the first character on the left and Get rid of everything else."
```{r}
# The slightly more complex way, to introduce you to string functions
babynames_clean <- babynames_dirty %>%
remove_empty() %>%
mutate(name_at_birth = str_trim(name_at_birth,side="both")) %>%
distinct() %>%
mutate(name_at_birth = str_to_title(name_at_birth)) %>%
mutate(name_at_birth = str_remove(name_at_birth,"First Name: ")) %>%
mutate(sex = str_sub(sex, start=1L, end=1L))
```
#### Fix 2015 column
The 2015 column is different from the other columns, with an underscore at the end. If we want to do math to that column, we'll have to fix it. Here, again, we can use `str_remove()`.
```{r}
babynames_clean <- babynames_dirty %>%
remove_empty() %>%
mutate(name_at_birth = str_trim(name_at_birth,side="both")) %>%
distinct() %>%
mutate(name_at_birth = str_to_title(name_at_birth)) %>%
mutate(name_at_birth = str_remove(name_at_birth,"First Name: ")) %>%
mutate(sex = str_sub(sex, start=1L, end=1L)) %>%
mutate(2015 = str_remove(2015,"_"))
```
But we get an error! Why? Because R hates it when you start column names with numbers. You could wrap it in tick marks when you use it, like `2015`, but that's pretty annoying.
Instead, we can use the helpful `clean_names()` function from the `janitor` package to standardize it. If our column names had spaces, or capital letters, it would fix that too.
```{r}
babynames_clean <- babynames_dirty %>%
remove_empty() %>%
mutate(name_at_birth = str_trim(name_at_birth,side="both")) %>%
distinct() %>%
mutate(name_at_birth = str_to_title(name_at_birth)) %>%
mutate(name_at_birth = str_remove(name_at_birth,"First Name: ")) %>%
mutate(sex = str_sub(sex, start=1L, end=1L)) %>%
clean_names()
```
It puts an x at the start of the name, so now we can get to cleaning 2015.
```{r}
babynames_clean <- babynames_dirty %>%
remove_empty() %>%
mutate(name_at_birth = str_trim(name_at_birth,side="both")) %>%
distinct() %>%
mutate(name_at_birth = str_to_title(name_at_birth)) %>%
mutate(name_at_birth = str_remove(name_at_birth,"First Name: ")) %>%
mutate(sex = str_sub(sex, start=1L, end=1L)) %>%
clean_names() %>%
mutate(x2015 = str_remove(x2015,"_"))
```
Okay, it looks like a proper number now. But if we look under the hood, at the data type of each column, we'll see it hasn't been fixed. We can do that with `glimpse()`
```{r}
glimpse(babynames_clean)
```
x2015 is still a "character". If we try to do math to it, it will error. So let's change the datatype to a number using `as.integer()`.
```{r}
babynames_clean <- babynames_dirty %>%
remove_empty() %>%
mutate(name_at_birth = str_trim(name_at_birth,side="both")) %>%
distinct() %>%
mutate(name_at_birth = str_to_title(name_at_birth)) %>%
mutate(name_at_birth = str_remove(name_at_birth,"First Name: ")) %>%
mutate(sex = str_sub(sex, start=1L, end=1L)) %>%
clean_names() %>%
mutate(x2015 = str_remove(x2015,"_")) %>%
mutate(x2015 = as.integer(x2015))
```
When we glimpse the data, it's been converted to the correct datatype.
```{r}
glimpse(babynames_clean)
```
#### Wrapping up
Thanks for joining. If you get stuck in the future and need some help, I'm at [email protected].
#### Bonus: Get data from a JSON API and convert to a dataframe.
We'll use the httr packages for this to fetch data from a JSON api, and jsonlite to conver it.
```{r}
#install.packages('httr')
#install.packages('jsonlite')
####
## Load libraries
####
library(httr)
library(jsonlite)
####
## Get a dataframe of people currently in space
####
# List of people currently in space, from NASA, from this example json API. Not sure how accurate this is, so, you know, don't rely on this for a story.
# http://api.open-notify.org/astros.json
# Send get request to website
# view it
people_in_space <- GET("http://api.open-notify.org/astros.json")
# Extract just the content
# view it
people_in_space <- people_in_space$content
# Convert it to characters
# view it
people_in_space <- rawToChar(people_in_space)
# Convert the json to a list
# view it
people_in_space <- fromJSON(people_in_space)
# Extract the "people" element from the list and it's a nice tidy dataframe
# view it
people_in_space <- people_in_space[["people"]]
# display it below
people_in_space
```