-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
116 lines (89 loc) · 5.92 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(collapse=TRUE, comment="##", fig.retina=2, fig.path = "README_figs/README-")
```
## 'fedregs': Text Analysis of the US Code of Federal Regulations
[![Project Status: Active - The project has reached a stable, usable state and is being actively developed.](http://www.repostatus.org/badges/0.1.0/active.svg)](http://www.repostatus.org/#active)
[![codecov](https://codecov.io/gh/NOAA-EDAB/fedregs/branch/master/graph/badge.svg)](https://codecov.io/gh/NOAA-EDAB/fedregs)
[![Travis-CI Build Status](https://travis-ci.org/NOAA-EDAB/fedregs.svg?branch=master)](https://travis-ci.org/NOAA-EDAB/fedregs)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/fedregs)](https://cran.r-project.org/package=fedregs)
![downloads](http://cranlogs.r-pkg.org/badges/grand-total/fedregs)
The goal of `fedregs` is to allow for easy exploration and analysis of the [Code of Federal Regulation](https://www.gpo.gov/fdsys/browse/collectionCfr.action?selectedYearFrom=2017&go=Go).
## Installation
You can install `fedregs` using:
```{r gh-installation, eval = FALSE}
install.packages("fedregs")
# Or: devtools::install_github("NOAA-EDAB/fedregs")
```
## Example
The [Code of Federal Regulation](https://www.gpo.gov/help/index.html#about_code_of_federal_regulations.htm) is organized according to a consistent hierarchy: title, chapter, part, subpart, section, and subsection. Each title within the CFR is (somewhat haphazardly) divided into volumes and over time each chapter isn't consistently in the same volume. The `cfr_text()` function is the main function in the package and it will return the text for a specified part, including the associated subparts and sections. Behind the scenes, `cfr_text()` and associated helper functions gather the volumes for a given title/year combination and parses XML to determine the chapters, parts, and subparts associated with each volume. Next, the text is extracted for each subpart. The `return_tidytext = TRUE` argument will return a tibble with the text in a [tidytext](https://www.tidytextmining.com/tidytext.html) format. If *ngrams* are your game, set `token = "ngrams"` and specify `n`.
```{r get_regs, echo = TRUE, message = FALSE, warning=FALSE}
library(fedregs)
library(dplyr)
library(tidyr)
library(ggplot2)
library(quanteda)
regs <- cfr_text(year = 2023,
title_number = 50,
chapter = 6,
part = 648,
#token = "ngrams", # uncomment for ngrams of length 2
#n = 2, # uncomment for ngrams of length 2
return_tidytext = TRUE,
verbose = FALSE)
head(regs)
```
Now, we can unnest the tibble and take a peek at the data to see what data we have to play with.
```{r peek_data, echo = TRUE}
regs %>%
unnest(cols = c(data)) %>% head(20) %>% pull(word)
```
Not entirely unexpected, but there are quite a few common words that don't mean anything. These "stop words" typically don't have important significance and and are filtered out from search queries.
```{r stop_words, echo = TRUE}
head(stopwords("english"))
```
There are some other messes like punctuation, numbers, *i*ths, Roman Numerals, web sites, and random letters (probably from indexed lists) that can be removed with some simple regex-ing. We can also convert the raw words to word stems to further aggregate our data.
```{r cleaning_words, echo = TRUE, warning = FALSE, message=FALSE}
stop_words <- tibble(word = stopwords("english"))
clean_words <- regs %>%
unnest(cols = c(data)) %>%
mutate(word = gsub("[[:punct:]]", "", word), # remove any remaining punctuation
word = gsub("^[[:digit:]]*", "", word)) %>% # remove digits (e.g., 1st, 1881a, 15th, etc)
anti_join(stop_words, by = "word") %>% # remove "stop words"
filter(is.na(as.numeric(word)),
!grepl("^m{0,4}(cm|cd|d?c{0,3})(xc|xl|l?x{0,3})(ix|iv|v?i{0,3})$",
word), # adios Roman Numerals
!grepl("\\b[a-z]{1}\\b", word), # get rid of one letter words
!grepl("\\bwww*.", word)) # get rid of web addresses
head(clean_words)
```
Now we can look at binning and plotting the words
```{r count_words}
count_words <- clean_words %>%
group_by(word) %>%
summarise(n = n()) %>%
ungroup() %>%
arrange(-n) %>%
top_n(n = 50, wt = n) %>%
mutate(word = reorder(word, n))
```
```{r plot_words, fig.width=10, fig.height=8}
ggplot(count_words, aes(word, n)) +
geom_col() +
labs(xlab = NULL,
title = "Code of Federal Regulations",
subtitle = "Title 50, Chapter VI, Part 648",
caption = sprintf("Data accessed on %s from:\n https://www.gpo.gov/fdsys/browse/collectionCfr.action?collectionCode=CFR",
format(Sys.Date(), "%d %B %Y"))) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_text(size = 8),
legend.direction = "horizontal",
legend.position = "bottom") +
coord_flip() +
NULL
```
**This repository is a scientific product and is not official communication of the National Oceanic and Atmospheric Administration, or the United States Department of Commerce. All NOAA GitHub project code is provided on an ‘as is’ basis and the user assumes responsibility for its use. Any claims against the Department of Commerce or Department of Commerce bureaus stemming from the use of this GitHub project will be governed by all applicable Federal law. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by the Department of Commerce. The Department of Commerce seal and logo, or the seal and logo of a DOC bureau, shall not be used in any manner to imply endorsement of any commercial product or activity by DOC or the United States Government.**