-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathProtocolGifts.Rmd
495 lines (421 loc) · 23.8 KB
/
ProtocolGifts.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
---
title: "Protocol Gift"
author: "Izzie Toren ([email protected])"
date: "2016-12-04"
output: html_document
---
```{r setup, include=FALSE, echo=FALSE, message=FALSE}
require(knitr)
opts_chunk$set(echo = FALSE)
display_n <- 5
display_col_max_width <- 200
## Convenience function to help displaying the very non-standard characteds in the text
char_cleanup <- function(z, encoding_from = 'UTF-8', encoding_to = 'UTF-8') {
require(tidyverse)
z %>%
iconv(from = encoding_from, to = encoding_to) %>%
gsub(pattern = '(^[[:graph:]])', replacement = ' ') %>%
gsub(pattern = '\\0', '') %>%
trimws() %>%
return()
}
## convenince function to show tables properly with kable
kable2 <- function(z, n = display_n) {
require(tidyverse)
z %>%
apply(MARGIN = 2, FUN = char_cleanup) %>%
tbl_df() %>%
head(n) %>%
kable()
}
```
## Background: The Protocol Gift Unit
Quoting from the [U.S. Department of State website](http://www.state.gov/s/cpr/c29447.htm):
> The Protocol Gift Unit within the Office of the Chief of Protocol serves as the central processing point for all tangible gifts received from foreign sources by employees of the Executive Branch of the Federal government. The Unit is responsible for the creation and maintenance of the official record of all gifts presented by the Department of State to officials of foreign governments. Working closely with the Chief of Protocol and the staffs of the President, the Vice President, and the Secretary of State, the Gift Unit selects the gifts presented to foreign dignitaries. Gifts received by the President, Vice President, and the Secretary of State and their spouses from foreign governments are also handled by the Gift Unit in the Office of Protocol.
The objective of this paper is to mine the data provided by the PGU (Protocol Gift Unit) in PDF format to build a structured dataset that can be analysed systematically using "tidy" tools.
## Bulding the dataset
### Step 1: Extract the links to the PDF files (one per year)
After exploring some alternatives I decided to use the [_rvest_](https://cran.r-project.org/web/packages/rvest/index.html) package to read and parse simple html file (thanks USDOS webmaster!) that contains the links to the PDF files on the PGU website. A simple analysis of the webpage shows that regular expressions are enough to extract the links into a table:
```{r extract_links, echo=TRUE, message=FALSE}
require(tidyverse)
require(rvest)
pdf_pattern <- 'http://www[.]state[.]gov/documents/organization/[0-9]{6}[.]pdf'
## Extract regex value from a string and retur
regex_extract <- function(y, pattern, ...) {
res <- regmatches(y, regexpr(pattern, y, ...))
if(length(res) == 0) {return(NA)} else {return(res)}
}
## Extract PDF links
read_html('http://www.state.gov/s/cpr/c29447.htm') %>%
html_node(xpath = "//table") %>%
html_nodes(xpath = '//a') %>%
## grab only links to PDF files for which the label is a year
grep(pattern = '[.]pdf\"><b>20[0-9]{2}</b>', value=TRUE) %>%
tbl_df() %>%
## strip down the surroundings
mutate(
pdf_link = map_chr(value, regex_extract, pdf_pattern),
pdf_year = gsub('(<|/|b|>|)', '', map_chr(value, regex_extract, '<b>[0-9]{4}</b>'))
) %>%
select(-value) -> link_table
link_table
```
### Step 2: Extract text from the PDFs
Each PDF contains a multi-page table, so a text extraction effort was necessary. I tried several packages but `pdftools` proved to be the most effective. Unfortunately I was only able to extract raw unstructured text (other packages like `tabulizer` or image extraction & OCR approach did not work well) so I ended up with one line per row. I then cleaned the various page headers/footers to get only the table text. In addition these PDFs contain multiple tables from multiple agencies, so selected only pages that describe presidential gifts. The title of the tables changes over time so I used a rather long REGEX to select the right tables (this can probably be done more elegantly).
```{r parse_pdf,echo=TRUE, message=FALSE}
agency_select_pattern <-
'(PRESIDENT OF THE U.S. AND THE NATIONAL SECURITY COUNCIL|WHITE HOUSE OFFICE AND THE NATIONAL SECURITY COUNCIL|THE WHITE HOUSE.{1}EXECUTIVE OFFICE OF THE PRESIDENT|EXECUTIVE OFFICE OF THE PRESIDENT)'
get_pdf_table_text = function(
pdf_link,
page_start_pattern = 'Federal Register / Vol',
page_end_pattern = '[ ]+VerDate',
table_pattern = '^[ ]+AGENCY:',
redundent_pattern1 = '(mstockstill|[A-Za-z]+ on [A-Z0-9]*PROD[A-Z0-9]* with (NOTICES|MISCELLANEOUS)|VerDate)',
agency_select_pattern = '.*'
) {
require(pdftools)
require(tidyverse)
x <- pdf_text(pdf = pdf_link) %>%
strsplit(split = '\r\n') %>%
unlist() %>%
setNames('value') %>%
tbl_df() %>%
## Convert to UTF8 for display compatibility
#mutate(value_ascii = iconv(value, from = 'UTF-8', to = 'latin1')) %>%
## Find the start of the page
mutate(
page = grepl(pattern = page_start_pattern, x = value),
page_end = grepl(pattern = page_end_pattern, x = value),
table_start = grepl(pattern = table_pattern , x = value)
) %>%
mutate(
page = cumsum(page),
table_start_sub_header = lag(table_start)) %>%
## Number the tables in each page
mutate(table_number = unlist(tapply(table_start/1000, page, cumsum))) %>%
## Remove text from cover page
filter(table_number > 0) %>%
mutate(table_number = page + table_number) %>%
## name the tables (for filtering)
group_by(table_number) %>%
mutate(table_name = head(value,1)) %>%
ungroup() %>%
## Cealnup 1: get rid of page headers/footers and other lines
filter(
!(table_start | table_start_sub_header | page_end) &
!grepl(pattern = redundent_pattern1, x = value)
) %>%
## Cleanup 2: filter only the required types of tables
filter(grepl(pattern = agency_select_pattern, x = table_name)) %>%
## Cleanup 3: remove spacial characters
#mutate(table_name = trimws(gsub(pattern = '([ ]{2,}|^[::ascii::])', replacement = ' ', x = table_name))) %>%
## Cleanup 4: remove unecessary tables
select(-page_end, -table_start, -table_start_sub_header) %>%
return()
}
link_table$pdf_link[2] %>% get_pdf_table_text(agency_select_pattern = agency_select_pattern) -> x1
x1 %>% head(5)
```
### Step 3: Transform the raw text into table columns
The text is not uniform in length, but this is probably because not all the columns of the tables are filled, and empty spaces do not appear after the end of the line.
```{r test_value_length, echo=TRUE, message=FALSE}
x1 %>% ggplot(aes(x = nchar(value)-50)) + geom_histogram(fill = 'grey') + theme_classic() + xlab('Text length')
```
I thought of splitting by white spaces (catch all white space clusters surrounded by any character), but Since there are a lot of white space clusters in the text we need to identify which ones are separators between columns and which ones are just WS within the text (as a result of the PDF text extraction). Below is a mapping of white spaces within each line captured from the PDF:
```{r white_space_map, echo=TRUE, message=FALSE}
code_row_pattern <- function(z, window) {
## Capture the pattern and the "anti pattern"
y_yes <- gregexpr(z, pattern = '[\\s]{1,}', perl = TRUE)
y_no <- gregexpr(z, pattern = '[^\\s]{1,}', perl = TRUE)
## If there's a match, build the vector
if (unlist(y_yes)[1] != -1) {
lengths_yes <- attr(y_yes[[1]], 'match.length')
lengths_no <- attr(y_no[[1]], 'match.length')
n <- length(lengths_no) + length(lengths_yes)
length_vec <- rep(NA, n)
rep_vec <- rep(NA, n)
odds <- 1:n %% 2 == as.numeric(unlist(y_yes)[1] == 1)
length_vec[odds] <- lengths_yes
length_vec[!odds] <- lengths_no
rep_vec[odds] <- lengths_yes
rep_vec[!odds] <- 0
return(rep(rep_vec, length_vec)[1:window])
} else {
return(NULL)
}
}
n_chars <- 170
x1 %>%
#head(10) %>%
mutate(
ws_coded = map(substring(value, 50), code_row_pattern, window = n_chars),
rownum = 1:n()
) %>%
select(ws_coded, rownum) %>%
unnest(ws_coded) %>%
mutate(position = 0:(n()-1) %% n_chars) %>%
ggplot(aes(position, rownum)) +
geom_tile(aes(fill = ifelse(ws_coded > 0, 'white space', 'text'))) +
theme(legend.title=element_blank(), legend.position="bottom") +
labs(title = "White Space Map", x = "Position", y = "Row")
```
The structure of the extracted text looks relatively stable, except for a single block of text (somewhere after row 700) which seems to have "shifted" to the right (but maintaining a similar structure). Using a "fixed width" approach is therefore impossible (we should expect this to happen in other files as well), but trying to use only white spaces will probably be tricky too: on the one hand there are too many clusters white space blocks within the text itself, and on the other hand the white space gaps between the columns are not uniform enough in size (some columns are closer together than the gapes inside the text). The end result is that without going into line-by-line analysis it's going to be very difficult to identify which white space block is a column separator and which block is just a result of the text extraction.
My approach was therefore to use dynamic delimitation: the main assumption is that column headers are good indicators for the beginning of the text (not directly though, the table header text is centered and therefore does not represent the beginning of the column).
```{r column_start_position, echo=TRUE, message=FALSE}
## The text the markes the beginning of the column
column_patterns <- c('Name and title of person', 'Gift, date of acceptance', 'Identity of foreign', 'Circumstances justifying')
## Where do the rows start (the headers start on different rows)
#column_row_offset <- c(1, 0, 2, 2)
## A function to use the location of the table header to split a row in the same page
value_row_split <- function(x, y, n = 1, offset = c(0, -7, -5, -8, -6), split_chr =';') {
p <- as.numeric(strsplit(y, split = split_chr)[[1]][(n-1):n])
if(n == 1) {p <- c(1,p)}
p[is.na(p)] <- 1000000L
## If there's no match returns NA
if(p[2] == -1) {return(NA)}
o <- offset[n:(n+1)]
o[is.na(o)] <- 0
return(substring(x, first = p[1] + o[1] , last = p[2] + o[2] - 1))
}
## A function to process an entire DF
split_row_text4 <- function(y, column_patterns) {
y %>%
## extract a string of column start position based on the text
mutate(column_starts = map(value, function(z) {map(column_patterns, regexpr, text = z)})) %>%
## Convert list into a single char with ';' separators
mutate(column_starts = map_chr(column_starts, paste, collapse = ';')) %>%
## Separate into 4 columns
separate(col = column_starts, into = paste0('Start',1:4), sep = ';', remove = TRUE, convert = TRUE) %>%
## use a single locator for each page (there should be one positive number per page, the rest are -1's due to the regexp output)
group_by(page) %>%
mutate(Start1 = max(Start1), Start2 = max(Start2), Start3 = max(Start3), Start4 = max(Start4)) %>%
ungroup() %>%
## Paste together again for easier digestion of "map2", and remove columns
mutate(column_starts = paste(Start1, Start2, Start3, Start4, sep = ';')) %>%
select(-Start1, -Start2, -Start3, -Start4) %>%
## Use the new string to create 4 columns
mutate(
Receiver = map2_chr(value, column_starts, value_row_split, n=2),
Gift_details = map2_chr(value, column_starts, value_row_split, n=3),
From = map2_chr(value, column_starts, value_row_split, n=4),
Justification = map2_chr(value, column_starts, value_row_split, n=5)
) %>%
## Cleanup: cloumns used for processing
select(-column_starts) %>%
return()
}
x1 %>% split_row_text4(column_patterns = column_patterns) -> x2
# write.table(x2, file = 'clipboard-4096', sep = '\t', row.names = FALSE)
x2 %>% select(Receiver, Gift_details, From, Justification) %>% head(10)
```
Word wrapping is prevalent throughout the document causing single gift entries to be split across multiple rows, but luckily the first line of each entry is offset by 2 characters to the left. I identity these rows, create `row_id` column per page, and use it to concatenate several rows into one (not forgetting to trim white spaces first). I also removed the table header (first 7 rows of each table) based on the beginning of the text.
```{r group_by_actual_row, echo=TRUE, message=FALSE}
## A function to group multiple rows into one by offset patterns
group_by_actual_row <- function(z, filter_pattern = '(Name and title of person accepting|the gift on behalf of the|^[ ]*$)') {
z %>%
## Identify left-idented rows from "Gift_details" and "Receiver"
mutate(
ident_Re = regexpr(pattern = '^[ ]+', text = Receiver) %>% attr('match.length'),
ident_GD = regexpr(pattern = '^[ ]+', text = Gift_details) %>% attr('match.length')
) %>%
## Create a lag / lead column for both
mutate(
ident_Re_lag = lag(x = ident_Re, n = 1, default = 0),
ident_Re_lead = lead(x = ident_Re, n = 1, default = 1e9),
ident_GD_lag = lag(x = ident_GD, n = 1, default = 0),
ident_GD_lead = lead(x = ident_GD, n = 1, default = 1e9)
) %>%
## A line starts if the next line is offset to the left and so is the prev line
mutate(row_start = as.numeric(ifelse((ident_Re < ident_Re_lead & ident_Re < ident_Re_lag) | (ident_GD < ident_GD_lead & ident_GD < ident_GD_lag), 1, 0))) %>%
## Mark each lines start (per page) and collapse the lines into a single line
mutate(row_id = cumsum(row_start)/100 + page) %>%
group_by(row_id) %>%
# filter(!grepl(x = Receiver, pattern = filter_pattern)) %>%
summarise(
Receiver = paste(Receiver, collapse = ' '),
Gift_details = paste(Gift_details, collapse = ' '),
From = paste(From, collapse = ' '),
Justification = paste(Justification, collapse = ' '),
control_text = paste(value, collapse = ';')
# Year = as.numeric(link_table$pdf_year[1])
) %>%
## Remove the first line of each table - usually contains headers
filter(!grepl(pattern = filter_pattern, x = Receiver)) %>%
#select(-row_id) %>%
return()
}
x2 %>% group_by_actual_row() -> x4
x4 %>% head(5)
```
For the final cleanup, we need to correct the text a bit (multiple white spaces) and long words split by different types of commas:
```{r clean_spaces_commas, echo=TRUE, message=FALSE}
clean_col_text <- function (y) {
require(magrittr)
y %>%
gsub(pattern = '[.]{2,}', replacement = '') %>%
gsub(pattern = '([ ]+)', replacement = ' ') %>%
gsub(pattern = '- ', replacement = '') %>%
trimws() %>%
return()
}
x4 %>%
mutate(
row_id = as.numeric(row_id),
Receiver = clean_col_text(Receiver),
Gift_details = clean_col_text(Gift_details),
From = clean_col_text(From),
Justification = clean_col_text(Justification)
)-> x5
```
So we can compare the "before":
```{r clean_spaces_commas_before, echo=FALSE, message=FALSE}
print(x4$Gift_details[1])
```
and the "after":
```{r clean_spaces_commas_after, echo=FALSE, message=FALSE}
print(x5$Gift_details[1])
```
### Step 4: Extract more data from the gift description column
The `Gift_detail` column seem to contain some internal structure of 4 sentences:
1. A short description of the gift
2. Date of receipt
3. Value estimation (I assume it's always USD)
4. Disposition
```{r gift_detail_exctract, echo=TRUE, message=FALSE}
gd_patterns <- c('([ ]|[.]|[.][ ]){0,1}(Re[\u2019]{0,1}c[\u2019]{0,1}d([\u2014]{0,}| )|Received[\u2014]|Rec[\u2014])', ' E(s|)(t|)[.]{0,1} Val(u|)(e|)([\u2014]{0,}| )', ' (Disposition|Location)([\u2014]{0,}| )')
## This function makes sure that the "grep" pattern does not contain -1's and makes sense in term of order - if there's no match then the distanct between the splits is 0.
monotone_gd <- function(y, len) {
for (i in length(y):1) {
if(y[i] == -1) {y[i] <- ifelse(i == length(y), len, y[i+1])}
}
return(y)
}
extract_gift_details <- function(z, patterns) {
z %>%
## Add a column containig the breakpoints.
mutate(gd_splits = Gift_details %>% map(function(z) {map(patterns, regexpr, text = z)})) %>%
## make sure that gd_splits vector is monotomous (up to nchar(Gift_details))
mutate(gd_splits = map2(gd_splits, nchar(Gift_details), monotone_gd)) %>%
## concatenate as a string for the split function
mutate(gd_splits = gd_splits %>% map_chr(paste, collapse = ';')) %>%
## Use the text column "gd_splits" to break down "Gift_details"
mutate(
Gift = map2_chr(Gift_details, gd_splits, value_row_split, n=1, offset = NA),
Rec_Date = map2_chr(Gift_details, gd_splits, value_row_split, n=2, offset = NA),
Value = map2_chr(Gift_details, gd_splits, value_row_split, n=3, offset = NA),
Disposition = map2_chr(Gift_details, gd_splits, value_row_split, n=4, offset = NA)
) %>%
## Cleanup
#select(-Gift_details, -gd_splits) %>%
return()
}
x5 %>% extract_gift_details(patterns = gd_patterns) -> x6
x6 %>% head(5)
```
### Step 5: Final clean-up
1. Simple clean up of `Gift` and `Disposition`
```{r gift_detail_cleanup01, echo=TRUE, message=FALSE}
x6 %>%
mutate(
Gift = Gift %>% trimws(),
Disposition = Disposition %>% map(function(y) {substring(y, first = regexpr(pattern = '\u2014', text = y)[1]+1)}) %>% trimws()
) -> x6
```
2. Identify and extract multiple date patterns:
```{r gift_detail_cleanup02, echo=TRUE, message=FALSE}
short_month_patterns <- '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)'
long_month_patterns <- '(January|February|March|April|May|June|July|August|Sep[ ]{0,1}tember|October|November|December)'
us_yyyy_date_pattern <- '[0-9]{1,2}[ ]{0,1}/[ ]{0,1}[0-9]{1,2}[ ]{0,1}/[ ]{0,1}20[0-9]{2}'
us_yy_date_pattern <- '[0-9]{1,2}[ ]{0,1}/[ ]{0,1}[0-9]{1,2}[ ]{0,1}/[ ]{0,1}[0-9]{2}'
oracle_date_pattern <- paste0('[0-9]{1,2}[ ]{0,1}(-|\u2013)[ ]{0,1}', short_month_patterns, '[ ]{0,1}(-|\u2013)[ ]{0,1}[0-9]{2}')
long_date_pattern <- paste0(long_month_patterns,'[ ]{0,1}[0-9]{1,2}[,]{0,1}[ ]{0,1}20[0-9]{2}')
no_day_date_pattern <- paste0(long_month_patterns,' 20[0-9]{2}')
x6 %>%
mutate(
Rec_Date_US_yyyy = as.Date(gsub(' ', '', map_chr(Rec_Date, regex_extract, us_yyyy_date_pattern)), '%m/%d/%Y'),
Rec_Date_US_yy = as.Date(gsub(' ', '', map_chr(Rec_Date, regex_extract, us_yy_date_pattern)), '%m/%d/%y'),
Rec_Date_Oracle = as.Date(gsub('\u2013', '-', gsub(' ', '', map_chr(Rec_Date, regex_extract, oracle_date_pattern))), '%d-%b-%y'),
Rec_Date_Long = as.Date(gsub(' ', '', map_chr(Rec_Date, regex_extract, long_date_pattern)), '%B%d,%Y'),
Rec_Date_No_Day = as.Date(paste0('01 ', map_chr(Rec_Date, regex_extract, no_day_date_pattern)), '%d %B %Y')
) %>%
mutate(Gift_Rec_Date =
if_else(!is.na(Rec_Date_US_yyyy), Rec_Date_US_yyyy,
if_else(!is.na(Rec_Date_US_yy), Rec_Date_US_yy,
if_else(!is.na(Rec_Date_Oracle), Rec_Date_Oracle,
if_else(!is.na(Rec_Date_Long), Rec_Date_Long, Rec_Date_No_Day))))
) -> x6
```
3. Identify and extract USD values of gifts:
```{r gift_detail_cleanup03, echo=TRUE, message=FALSE}
x6 %>%
mutate(
Value_tmp1 = Value %>% map_chr(regex_extract, '[$][ ]{0,1}\\d{1,3}(,\\d{3})*(\\.\\d+)?'),
Value_tmp2 = Rec_Date %>% map_chr(regex_extract, '[$][ ]{0,1}\\d{1,3}(,\\d{3})*(\\.\\d+)?')
) %>%
mutate(Value_USD = gsub('[$]', '', if_else(!is.na(Value_tmp1), Value_tmp1, Value_tmp2))) -> x6
```
4. Filter rows with table headers rows of missing details in text (see list of issues below for details) and remove the "technical" columns
```{r gift_detail_cleanup04, echo=TRUE, message=FALSE}
x6 %>%
filter(!is.na(Gift_Rec_Date) & !is.na(Value_USD)) %>%
## Cleanup of non-informative columns
select(-Gift_details, -control_text, -gd_splits, -starts_with('Rec_Date'), -starts_with('Value_tmp'), -Value) -> x6
head(x6)
```
### Step 6: Rinse and repeat...
Running through the different PDF links we can aggregate data from multiple years. The one issue I found was that the table names for presidential gift has changed over the years, but the fact that we used the tidy approach makes this wonderfully readable. You can actually pipe all function to one another in a single `mutate` statement but I preferred to split the steps for debugging.
```{r full_gift_extract, echo=TRUE, message=FALSE}
link_table %>%
mutate(text_df = pdf_link %>% map(get_pdf_table_text, agency_select_pattern = agency_select_pattern)) %>%
## for testing only: readRDS('gifts01.RDS') %>%
mutate(text_df = text_df %>% map(split_row_text4, column_patterns = column_patterns)) %>%
## Filter some problems on 2003 pages 31,32
mutate(text_df = text_df %>% map(filter, !is.na(Receiver))) %>%
mutate(text_df = text_df %>% map(group_by_actual_row)) %>%
mutate(text_df = text_df %>% map(mutate,
row_id = as.numeric(row_id),
Receiver = clean_col_text(Receiver),
Gift_details = clean_col_text(Gift_details),
From = clean_col_text(From),
Justification = clean_col_text(Justification)
)) %>%
mutate(text_df = text_df %>% map(extract_gift_details, patterns = gd_patterns)) %>%
unnest(text_df) %>%
## simple clean-ups of Gift & Disposition
mutate(
Gift = Gift %>% trimws(),
Disposition = Disposition %>% map(function(y) {substring(y, first = regexpr(pattern = '\u2014', text = y)[1]+1)}) %>% trimws()
) %>%
## More complex clean-ups: extract dates from multiple formats
mutate(
Rec_Date_US_yyyy = as.Date(gsub(' ', '', map_chr(Rec_Date, regex_extract, us_yyyy_date_pattern)), '%m/%d/%Y'),
Rec_Date_US_yy = as.Date(gsub(' ', '', map_chr(Rec_Date, regex_extract, us_yy_date_pattern)), '%m/%d/%y'),
Rec_Date_Oracle = as.Date(gsub('\u2013', '-', gsub(' ', '', map_chr(Rec_Date, regex_extract, oracle_date_pattern))), '%d-%b-%y'),
Rec_Date_Long = as.Date(gsub(' ', '', map_chr(Rec_Date, regex_extract, long_date_pattern)), '%B%d,%Y'),
Rec_Date_No_Day = as.Date(paste0('01 ', map_chr(Rec_Date, regex_extract, no_day_date_pattern)), '%d %B %Y')
) %>%
mutate(Gift_Rec_Date =
if_else(!is.na(Rec_Date_US_yyyy), Rec_Date_US_yyyy,
if_else(!is.na(Rec_Date_US_yy), Rec_Date_US_yy,
if_else(!is.na(Rec_Date_Oracle), Rec_Date_Oracle,
if_else(!is.na(Rec_Date_Long), Rec_Date_Long, Rec_Date_No_Day))))
) %>%
## Extract USD values
mutate(
Value_tmp1 = Value %>% map_chr(regex_extract, '[$][ ]{0,1}\\d{1,3}(,\\d{3})*(\\.\\d+)?'),
Value_tmp2 = Rec_Date %>% map_chr(regex_extract, '[$][ ]{0,1}\\d{1,3}(,\\d{3})*(\\.\\d+)?')
) %>%
mutate(Value_USD = gsub('[$]', '', if_else(!is.na(Value_tmp1), Value_tmp1, Value_tmp2))) %>%
## Filter remaining headers + 3 gifts with some missing details in text (see list of issues for details)
filter(!is.na(Gift_Rec_Date) & !is.na(Value_USD)) %>%
## Cleanup of non-informative columns
select(-Gift_details, -control_text, -gd_splits, -starts_with('Rec_Date'), -starts_with('Value_tmp'), -Value) -> gifts
write_csv(gifts, paste0('protocol_gifts_',min(gifts$pdf_year), '_', max(gifts$pdf_year),'.csv'))
```
## Current Issues:
2. The `Justification` column does not split correctly for about 2.5% of row (about 500 out of 20k rows in table prior to joining lines together). This will influence the text of this column and the `From` column, and also means that text quality does not allow analysis.
3. In older files, the `Receiver` column is not filled for every gift (the receiver is mentioned only one time per page). This caused some gifts to be bundeled together (I exctracted details only for the first gift). You can explore these cases by looking at columns with a very long `Gift` or `Gift_detail` column.
4. Exclusions:
1. Last 2 pages of 2003 PDF do not parse correctly. I excluded them with the filter `is.na(Receiver)`.
2. Problematic dates: 8 gifts were excluded from the analysis because their dates did not follow any logical pattern (e.g 19/2-4/2005). You can remove/invert the `filter(!is.na(Gift_Rec_Date))` and look through the values to find them (other lines would be table headers).
3. No Value: 1 gift ("round table made of hardened lava") from 17/01/2003 did not have an evaluation.