2022_NICAR_R_Training.Rmd

---
title: "Learning R"
author: "Lucia Walinchus"
date: "6/23/2020"
output: html_document
---

#Welcome to learning R at NICAR 2022! 

Overview of the hands-on course: 

*Session 1: 
Jump into data analysis with R, the powerful open-source programming language. In this class we’ll cover R fundamentals and learn our way around the RStudio interface for using R.

This session is good for: People with a basic understanding of code who are ready to go beyond Excel.

*Session 2: 
We'll use the tidyverse packages dplyr and ggplot2, learning how to sort, filter, group, summarize, join, and visualize to identify trends in your data. If you want to combine SQL-like analysis and charting in a single pipeline, this session is for you.

This session is good for: People who have worked with data operations in SQL or Excel and would like to do the same in R.

*Session 3: 
Learn how to use R to scrape data from web pages, access APIs and transform the results into usable data. This session will also focus on how to clean and structure the data you've gathered in preparation for analysis using tidyverse packages.

This session is good for: People who have used R and have a basic understanding of how to retrieve data from APIs.

###First up: WTH am I here?

What R makes easier: 
*Collaboration: 
Eye on Ohio Police Stops: https://eyeonohio.com/investigation-blacks-black-neighborhoods-most-likely-to-be-traffic-stop-targets-in-ohios-3-biggest-cities/
ProPublica: https://www.propublica.org/article/minority-neighborhoods-higher-car-insurance-premiums-methodology

*You can save your script: 
https://www.legislature.ohio.gov/schedules/legislative-calendar

*Very cool packages: 
-Census Data
-Mapping Data
-Automatically updating data
-Modeling Data
-Dates Are Easier!
-And More

####Play around with Math
```{r}
 5 + 7
5 ^7
5/7
mean(5:7)
```


####Installing a package

```{r}
install.packages("lubridate")
library(lubridate)
```

Okay let's take this puppy for a spin! 
###Creating a Variable

Format: 

Yo.Variable <- equation or constant

```{r}
dob <- "1999-12-31"
today()
today()-as_date(dob)


#If you want to get fancy

interval_period = interval(dob, today())
full_year = interval_period %/% years(1)
#Note: %/% is integer division. 5 %/% 2 is 2.
remaining_weeks = interval_period %% years(1) %/% weeks(1)
remaining_days = interval_period %% years(1) %% weeks(1) %/% days(1)
sprintf('Your age is %d years, %d weeks and %d days', full_year, remaining_weeks, remaining_days)

```

Loading all the other packages we need:


```{r}
library(dplyr)
library(ggplot2)
library(rio)
library(zip)
library(purrr)
library(tidyverse)
library(fs)
library(DT)

```

###Setting the working directory
Can also do Session > Set Working Directory
```{r}
#setwd(Desktop/hands_on_classes/20220303_r_1_intro_to_r_and_rstudio_2073)
```


####Importing a File

Format: 

import(your file or the URL)

Can also specify rio::import

```{r}
Firefighters <- import("https://www.iaff.org/wp-content/uploads/Copy-of-FFFM20-21_470-names.xlsx")
```

###Looking through documentation
Check this out too 
https://cran.r-project.org/web/packages/rio/vignettes/rio.html

How would you get this to work? 
https://www.nrscotland.gov.uk/files/statistics/babies-names/20/babies-first-names-20-data.xlsx
```{r}
?read_excel
```


## Baby Names


Bringing in the government's list of baby names from the [Social Security Administration](https://www.ssa.gov/oact/babynames/limits.html)
NOTE: This is only the most popular 1,000 names per year. 
```{r}
setwd("~Desktop/hands_on_classes/20220303_r_1_intro_to_r_and_rstudio_2073")
#download.file("https://www.ssa.gov/oact/babynames/names.zip", dest="dataset.zip", mode="wb") 
#unzip ("dataset.zip", exdir = "./")
```
Putting all those files together and making it look nice 


```{r}

setwd("~Desktop/hands_on_classes/20220303_r_1_intro_to_r_and_rstudio_2073")

#list_of_files <- list.files(path = ".", recursive = TRUE,
                            #pattern = "\\.txt$", 
                           # full.names = TRUE)

#df <- list_of_files %>%
  #purrr::set_names(nm = (basename(.) %>% tools::file_path_sans_ext())) %>%
  #purrr::map_df(read_csv, 
                #col_names = FALSE,
                #.id = "FileName",
                #col_types = cols(X2 = col_character()))


#US_Names <- df %>%
  #mutate(year = str_sub(FileName,4,8)) %>% 
  #select(-FileName) %>% 
  #mutate(name = X1, gender = X2, number = X3) %>% 
  #select(-X1, -X2, -X3) %>% 
  #distinct()

#US_Names$year <- as.numeric(US_Names$year)

```


New! Babynames package: 

```{r}
install.packages("babynames")
library(babynames)
babynames <- babynames
```


Making a summary of your data: 


```{r}
summary(babynames)
```
Looking at the structure

```{r}
str(babynames)
```


What is every name total?

```{r}
unique(babynames$name)
```


How many different names are there total?
Note this counts names of different sexes as two, and it doesn't count names fewer than 5. 

```{r}
length(unique(babynames$name))
```


####Using DPLYR verbs 

```{r}
head(babynames)
```


```{r}
babynames %>% select(-prop) %>%  head()
```

```{r}
babynames %>% filter(name=="Mary") %>% head()
```

```{r}
datatable(babynames %>% filter(name=="Mary"))
```


```{r}
babynames %>%
 filter(name=="Mary") %>%
  summarize(sum(n))
```

```{r}
babynames <- babynames %>% 
  rename(number=n)

head(babynames)
```

```{r}
Babynames_By_Name <- babynames %>% 
  group_by(name) %>% 
  summarize(Total=sum(number))
  
```

```{r}
Babynames_By_Name_and_Year <- babynames %>% 
  group_by(year, name) %>% 
  summarize(Total=sum(number))
  
```

```{r}
Babynames_By_Name_and_Year <- babynames %>% 
  group_by(year, name) %>% 
  summarize(Total=sum(number)) %>% 
  arrange(desc(Total))
  
```

```{r}
Top_Names_Your_Year <- babynames %>% 
  filter(year==1985) %>% 
  group_by(name) %>% 
  summarize(Total=sum(number)) %>% 
  mutate(Top=data.table::frankv(Total, order=-1, ties.method ="first"))
```

frankv stands for "fast rank vector"


Bonus! Save your name


```{r}
Marys <- babynames %>% 
  filter(name=="Mary" & sex == "F")

ggplot(Marys, aes(year, number))+
  geom_point()+
  theme(axis.text.x = element_text(angle = 45, vjust =1.2, hjust = 1.1))+
  labs(title = "Children Named 'Mary' in the United States per Year", subtitle = "Note: Data only includes top 1,000 names per year.")

ggsave("Mary.jpg", plot=last_plot())
```


Now for the state data set. 
```{r}
#download.file("https://www.ssa.gov/oact/babynames/state/namesbystate.zip", destfile = "state_names_dataset.zip", mode="wb") 
#unzip ("state_names_dataset.zip", exdir = "State_Names/")
```


```{r}
#setwd("~/Code/Red/BabyNames/State_Names")

#list_of_files <- list.files(path = ".", recursive = TRUE,
                            #pattern = "{6}",
                            #pattern = "\\.txt$",
                            #full.names = TRUE)

#list_of_files <-list_of_files[- 43]

#df <- list_of_files %>%
  #purrr::set_names(nm = (basename(.) %>% tools::file_path_sans_ext())) %>%
  #purrr::map_df(read_csv, 
                #col_names = FALSE,
               # .id = "FileName",
               # col_types = cols(
  #X1 = col_character(),
  #X2 = col_character(),
  #X3 = col_double(),
  #X4 = col_character(),
  #X5 = col_double()
#))
```


Looking just at *Ohio*
```{r}
OH_Names <- df %>%
  select(-FileName) %>% 
  mutate(state=X1, gender = X2, year = X3, name = X4, number = X5) %>% 
  select(-X1,-X2,-X3,-X4,-X5) %>% 
  filter(state=="OH") %>% 
  distinct()

```

Most popular in Ohio

```{r}
Most_Popular_OH <- OH_Names %>% 
  group_by(year, name, number) %>% 
  arrange(desc(number))


datatable(Most_Popular_OH)
```


Lots

```{r}
ManyNames <- US_Names %>% 
  filter(name=="Edith" |
           name=="Rose"|
           name=="Gina"|
           name=="Tara"|
           name=="Kaitlyn"|
           name=="Tanisha"|
           name=="Maria"|
           name=="Yvonne"|
           name=="Lois"& gender =="F")

ggplot(ManyNames, aes(year, number, fill=name))+
  geom_bar(stat = "identity")+
  facet_wrap(~name)+
  theme(axis.text.x = element_text(angle = 45, vjust =1.2, hjust = 1.1))+
  labs(title = "US Baby Names",subtitle = "Sample of 9 popular baby names in the US per year")

```


R Intro:
https://github.com/rtburg/NICAR2020-Intro-to-R/blob/master/Intro_to_R_and_RStudio.Rmd