ProtocolGifts.Rmd

---
title: "Protocol Gift"
author: "Izzie Toren (ytoren@gmail.com)"
date: "2016-12-04"
output: html_document
---

```{r setup, include=FALSE, echo=FALSE, message=FALSE}
require(knitr)
opts_chunk$set(echo = FALSE)
display_n <- 5
display_col_max_width <- 200

## Convenience function to help displaying the very non-standard characteds in the text
char_cleanup <- function(z, encoding_from = 'UTF-8', encoding_to = 'UTF-8') {
  require(tidyverse)
  z %>%
    iconv(from = encoding_from, to = encoding_to) %>% 
    gsub(pattern = '(^[[:graph:]])', replacement = ' ') %>%
    gsub(pattern = '\\0', '') %>%
    trimws() %>%
    return()
}

## convenince function to show tables properly with kable
kable2 <- function(z, n = display_n) {
  require(tidyverse)
  z %>%
    apply(MARGIN = 2, FUN = char_cleanup) %>% 
    tbl_df() %>%  
    head(n) %>% 
    kable()
}
```

## Background: The Protocol Gift Unit
Quoting from the [U.S. Department of State website](http://www.state.gov/s/cpr/c29447.htm):

> The Protocol Gift Unit within the Office of the Chief of Protocol serves as the central processing point for all tangible gifts received from foreign sources by employees of the Executive Branch of the Federal government. The Unit is responsible for the creation and maintenance of the official record of all gifts presented by the Department of State to officials of foreign governments. Working closely with the Chief of Protocol and the staffs of the President, the Vice President, and the Secretary of State, the Gift Unit selects the gifts presented to foreign dignitaries. Gifts received by the President, Vice President, and the Secretary of State and their spouses from foreign governments are also handled by the Gift Unit in the Office of Protocol. 

The objective of this paper is to mine the data provided by the PGU (Protocol Gift Unit) in PDF format to build a structured dataset that can be analysed systematically using "tidy" tools.

## Bulding the dataset

### Step 1: Extract the links to the PDF files (one per year)

After exploring some alternatives I decided to use the [_rvest_](https://cran.r-project.org/web/packages/rvest/index.html) package to read and parse simple html file (thanks USDOS webmaster!) that contains the links to the PDF files on the PGU website. A simple analysis of the webpage shows that regular expressions are enough to extract the links into a table:

```{r extract_links, echo=TRUE, message=FALSE}
require(tidyverse)
require(rvest)

pdf_pattern <- 'http://www[.]state[.]gov/documents/organization/[0-9]{6}[.]pdf'

## Extract regex value from a string and retur
regex_extract <- function(y, pattern, ...) {
  res <- regmatches(y, regexpr(pattern, y, ...))
  if(length(res) == 0) {return(NA)} else {return(res)}
}


## Extract PDF links
read_html('http://www.state.gov/s/cpr/c29447.htm') %>%
  html_node(xpath = "//table") %>%
  html_nodes(xpath = '//a') %>%
  ## grab only links to PDF files for which the label is a year
  grep(pattern = '[.]pdf\"><b>20[0-9]{2}</b>', value=TRUE) %>%
  tbl_df() %>%
  ## strip down the surroundings
  mutate(
    pdf_link = map_chr(value, regex_extract, pdf_pattern),
    pdf_year = gsub('(<|/|b|>|)', '', map_chr(value, regex_extract, '<b>[0-9]{4}</b>'))
  ) %>% 
  select(-value) -> link_table

link_table
```

### Step 2: Extract text from the PDFs

Each PDF contains a multi-page table, so a text extraction effort was necessary. I tried several packages but `pdftools` proved to be the most effective. Unfortunately I was only able to extract raw unstructured text (other packages like `tabulizer` or image extraction & OCR approach did not work well) so I ended up with one line per row. I then cleaned the various page headers/footers to get only the table text. In addition these PDFs contain multiple tables from multiple agencies, so selected only pages that describe presidential gifts. The title of the tables changes over time so I used a rather long REGEX to select the right tables (this can probably be done more elegantly).

```{r parse_pdf,echo=TRUE, message=FALSE}
agency_select_pattern <- 
'(PRESIDENT OF THE U.S. AND THE NATIONAL SECURITY COUNCIL|WHITE HOUSE OFFICE AND THE NATIONAL SECURITY COUNCIL|THE WHITE HOUSE.{1}EXECUTIVE OFFICE OF THE PRESIDENT|EXECUTIVE OFFICE OF THE PRESIDENT)'

get_pdf_table_text = function(
  pdf_link,
  page_start_pattern = 'Federal Register / Vol',
  page_end_pattern = '[ ]+VerDate',
  table_pattern = '^[ ]+AGENCY:',
  redundent_pattern1 = '(mstockstill|[A-Za-z]+ on [A-Z0-9]*PROD[A-Z0-9]* with (NOTICES|MISCELLANEOUS)|VerDate)',
  agency_select_pattern = '.*'
  ) { 

  require(pdftools)
  require(tidyverse)
  
  x <- pdf_text(pdf = pdf_link) %>%
    strsplit(split = '\r\n') %>%
    unlist() %>%
    setNames('value') %>%
    tbl_df() %>%
    ## Convert to UTF8 for display compatibility
    #mutate(value_ascii = iconv(value, from = 'UTF-8', to = 'latin1')) %>%
    ## Find the start of the page
    mutate(
      page = grepl(pattern = page_start_pattern, x = value),
      page_end = grepl(pattern = page_end_pattern, x = value),
      table_start = grepl(pattern = table_pattern , x = value)
      ) %>%
    mutate(
      page = cumsum(page),
      table_start_sub_header = lag(table_start)) %>%
    ## Number the tables in each page
    mutate(table_number = unlist(tapply(table_start/1000, page, cumsum))) %>%
    ## Remove text from cover page
    filter(table_number > 0) %>%
    mutate(table_number = page + table_number) %>%
    ## name the tables (for filtering)
    group_by(table_number) %>%
    mutate(table_name = head(value,1)) %>%
    ungroup() %>%
    ## Cealnup 1: get rid of page headers/footers and other lines
    filter(
        !(table_start | table_start_sub_header | page_end) & 
        !grepl(pattern = redundent_pattern1, x = value)
      ) %>%
    ## Cleanup 2: filter only the required types of tables
    filter(grepl(pattern = agency_select_pattern, x = table_name)) %>%
    ## Cleanup 3: remove spacial characters
    #mutate(table_name = trimws(gsub(pattern = '([ ]{2,}|^[::ascii::])', replacement = ' ', x = table_name))) %>%
    ## Cleanup 4: remove unecessary tables
    select(-page_end, -table_start, -table_start_sub_header) %>%
    return()
}

link_table$pdf_link[2] %>% get_pdf_table_text(agency_select_pattern = agency_select_pattern) -> x1
x1 %>% head(5)
```

### Step 3: Transform the raw text into table columns

The text is not uniform in length, but this is probably because not all the columns of the tables are filled, and empty spaces do not appear after the end of the line.

```{r test_value_length, echo=TRUE, message=FALSE}
x1 %>% ggplot(aes(x = nchar(value)-50)) + geom_histogram(fill = 'grey') + theme_classic() + xlab('Text length')
```

I thought of splitting by white spaces (catch all white space clusters surrounded by any character), but Since there are a lot of white space clusters in the text we need to identify which ones are separators between columns and which ones are just WS within the text (as a result of the PDF text extraction). Below is a mapping of white spaces within each line captured from the PDF:

```{r white_space_map, echo=TRUE, message=FALSE}
code_row_pattern <- function(z, window) {
  ## Capture the pattern and the "anti pattern"
  y_yes <- gregexpr(z, pattern = '[\\s]{1,}', perl = TRUE)
  y_no <-  gregexpr(z, pattern = '[^\\s]{1,}', perl = TRUE)
  
  ## If there's a match, build the vector
  if (unlist(y_yes)[1] != -1) {
    lengths_yes <- attr(y_yes[[1]], 'match.length')
    lengths_no <-  attr(y_no[[1]],  'match.length')
    
    n <- length(lengths_no) + length(lengths_yes)

    length_vec <- rep(NA, n)
    rep_vec <- rep(NA, n)
    odds <- 1:n %% 2 == as.numeric(unlist(y_yes)[1] == 1)
    
    length_vec[odds] <-  lengths_yes
    length_vec[!odds] <- lengths_no

    rep_vec[odds] <-  lengths_yes
    rep_vec[!odds] <- 0
    
    return(rep(rep_vec, length_vec)[1:window])
  } else {
    return(NULL)
  }
}

n_chars <- 170

x1 %>% 
  #head(10) %>%
  mutate(
    ws_coded = map(substring(value, 50), code_row_pattern, window = n_chars), 
    rownum = 1:n()
  ) %>% 
  select(ws_coded, rownum) %>%
  unnest(ws_coded) %>%  
  mutate(position = 0:(n()-1) %% n_chars) %>%
  ggplot(aes(position, rownum)) +
    geom_tile(aes(fill = ifelse(ws_coded > 0, 'white space', 'text'))) +
    theme(legend.title=element_blank(), legend.position="bottom") + 
    labs(title = "White Space Map", x = "Position", y = "Row")
```

The structure of the extracted text looks relatively stable, except for a single block of text (somewhere after row 700) which seems to have "shifted" to the right (but maintaining a similar structure). Using a "fixed width" approach is therefore impossible (we should expect this to happen in other files as well), but trying to use only white spaces will probably be tricky too: on the one hand there are too many clusters white space blocks within the text itself, and on the other hand the white space gaps between the columns are not uniform enough in size (some columns are closer together than the gapes inside the text). The end result is that without going into line-by-line analysis it's going to be very difficult to identify which white space block is a column separator and which block is just a result of the text extraction.

My approach was therefore to use dynamic delimitation: the main assumption is that column headers are good indicators for the beginning of the text (not directly though, the table header text is centered and therefore does not represent the beginning of the column).

```{r column_start_position, echo=TRUE, message=FALSE}
## The text the markes the beginning of the column
column_patterns <- c('Name and title of person', 'Gift, date of acceptance', 'Identity of foreign', 'Circumstances justifying')
## Where do the rows start (the headers start on different rows)
#column_row_offset <- c(1, 0, 2, 2)

## A function to use the location of the table header to split a row in the same page
value_row_split <- function(x, y, n = 1, offset = c(0, -7, -5, -8, -6), split_chr  =';') {
  p <- as.numeric(strsplit(y, split = split_chr)[[1]][(n-1):n])
  if(n == 1) {p <- c(1,p)}
  p[is.na(p)] <- 1000000L
  ## If there's no match returns NA
  if(p[2] == -1) {return(NA)}
  o <- offset[n:(n+1)]
  o[is.na(o)] <- 0
  return(substring(x, first = p[1] + o[1] , last = p[2] + o[2] - 1))
}

## A function to process an entire DF 
split_row_text4 <- function(y, column_patterns) {
  y %>% 
    ## extract a string of column start position based on the text
    mutate(column_starts = map(value, function(z) {map(column_patterns, regexpr, text = z)})) %>%
    ## Convert list into a single char with ';' separators
    mutate(column_starts = map_chr(column_starts, paste, collapse = ';')) %>%
    ## Separate into 4 columns
    separate(col = column_starts, into = paste0('Start',1:4), sep = ';', remove = TRUE, convert = TRUE) %>% 
    ## use a single locator for each page (there should be one positive number per page, the rest are -1's due to the regexp output)
    group_by(page) %>%
    mutate(Start1 = max(Start1), Start2 = max(Start2), Start3 = max(Start3), Start4 = max(Start4)) %>%
    ungroup() %>%
    ## Paste together again for easier digestion of "map2", and remove columns
    mutate(column_starts = paste(Start1, Start2, Start3, Start4, sep = ';')) %>%
    select(-Start1, -Start2, -Start3, -Start4) %>%
    ## Use the new string to create 4 columns
    mutate(
      Receiver = map2_chr(value, column_starts, value_row_split, n=2),
      Gift_details = map2_chr(value, column_starts, value_row_split, n=3),
      From = map2_chr(value, column_starts, value_row_split, n=4),
      Justification = map2_chr(value, column_starts, value_row_split, n=5)
    ) %>% 
    ## Cleanup: cloumns used for processing 
    select(-column_starts) %>%
    return()
}

x1 %>% split_row_text4(column_patterns = column_patterns) -> x2
# write.table(x2, file = 'clipboard-4096', sep = '\t', row.names = FALSE)
x2 %>% select(Receiver, Gift_details, From, Justification) %>% head(10)
```

Word wrapping is prevalent throughout the document causing single gift entries to be split across multiple rows, but luckily the first line of each entry is offset by 2 characters to the left. I identity these rows, create `row_id` column per page, and use it to concatenate several rows into one (not forgetting to trim white spaces first). I also removed the table header (first 7 rows of each table) based on the beginning of the text.

```{r group_by_actual_row, echo=TRUE, message=FALSE}
## A function to group multiple rows into one by offset patterns
group_by_actual_row <- function(z, filter_pattern = '(Name and title of person accepting|the gift on behalf of the|^[ ]*$)') {
  z %>%
    ## Identify left-idented rows from "Gift_details" and "Receiver"
    mutate(
      ident_Re = regexpr(pattern = '^[ ]+', text = Receiver) %>% attr('match.length'),
      ident_GD = regexpr(pattern = '^[ ]+', text = Gift_details) %>% attr('match.length')
    ) %>%
    ## Create a lag / lead column for both
    mutate(
      ident_Re_lag  =  lag(x = ident_Re, n = 1, default = 0),
      ident_Re_lead = lead(x = ident_Re, n = 1, default = 1e9),
      ident_GD_lag  =  lag(x = ident_GD, n = 1, default = 0),
      ident_GD_lead = lead(x = ident_GD, n = 1, default = 1e9)
    ) %>% 
    ## A line starts if the next line is offset to the left and so is the prev line
    mutate(row_start = as.numeric(ifelse((ident_Re < ident_Re_lead & ident_Re < ident_Re_lag) | (ident_GD < ident_GD_lead & ident_GD < ident_GD_lag), 1, 0))) %>% 
    ## Mark each lines start (per page) and collapse the lines into a single line
    mutate(row_id = cumsum(row_start)/100 + page) %>% 
    group_by(row_id) %>%
    # filter(!grepl(x = Receiver, pattern = filter_pattern)) %>%
    summarise(
      Receiver = paste(Receiver, collapse = ' '),
      Gift_details = paste(Gift_details, collapse = ' '),
      From = paste(From, collapse = ' '),
      Justification = paste(Justification, collapse = ' '),
      control_text = paste(value, collapse = ';')
    # Year = as.numeric(link_table$pdf_year[1])
    ) %>%
    ## Remove the first line of each table - usually contains headers
    filter(!grepl(pattern = filter_pattern, x = Receiver)) %>% 
    #select(-row_id) %>%
    return()
}

x2 %>% group_by_actual_row() -> x4
x4 %>% head(5)
```

For the final cleanup, we need to correct the text a bit (multiple white spaces) and long words split by different types of commas:

```{r clean_spaces_commas, echo=TRUE, message=FALSE}
clean_col_text <- function (y) {
  require(magrittr)
  y %>%
    gsub(pattern = '[.]{2,}', replacement = '') %>%
    gsub(pattern = '([ ]+)', replacement = ' ') %>%
    gsub(pattern = '- ', replacement = '') %>% 
    trimws() %>%
    return()
}

x4 %>% 
  mutate(
    row_id = as.numeric(row_id),
    Receiver = clean_col_text(Receiver),
    Gift_details = clean_col_text(Gift_details),
    From = clean_col_text(From),
    Justification = clean_col_text(Justification)
  )-> x5
```

So we can compare the "before":
```{r clean_spaces_commas_before, echo=FALSE, message=FALSE} 
print(x4$Gift_details[1])
```

and the "after":
```{r clean_spaces_commas_after, echo=FALSE, message=FALSE} 
print(x5$Gift_details[1])
```

### Step 4: Extract more data from the gift description column

The `Gift_detail` column seem to contain some internal structure of 4 sentences:

1. A short description of the gift
2. Date of receipt
3. Value estimation (I assume it's always USD)
4. Disposition

```{r gift_detail_exctract, echo=TRUE, message=FALSE} 
gd_patterns <- c('([ ]|[.]|[.][ ]){0,1}(Re[\u2019]{0,1}c[\u2019]{0,1}d([\u2014]{0,}| )|Received[\u2014]|Rec[\u2014])', ' E(s|)(t|)[.]{0,1} Val(u|)(e|)([\u2014]{0,}| )', ' (Disposition|Location)([\u2014]{0,}| )')

## This function makes sure that the "grep" pattern does not contain -1's and makes sense in term of order - if there's no match then the distanct between the splits is 0.
monotone_gd <- function(y, len) {
  for (i in length(y):1) {
    if(y[i] == -1) {y[i] <- ifelse(i == length(y), len, y[i+1])}
  }
  return(y)
}

extract_gift_details <- function(z, patterns) { 
  z %>%
    ## Add a column containig the breakpoints. 
    mutate(gd_splits = Gift_details %>% map(function(z) {map(patterns, regexpr, text = z)})) %>%
    ## make sure that gd_splits vector is monotomous (up to nchar(Gift_details))
    mutate(gd_splits = map2(gd_splits, nchar(Gift_details), monotone_gd)) %>%
    ## concatenate as a string for the split function
    mutate(gd_splits = gd_splits %>% map_chr(paste, collapse = ';')) %>%
    ## Use the text column "gd_splits" to break down "Gift_details"
    mutate(
        Gift  = map2_chr(Gift_details, gd_splits, value_row_split, n=1, offset = NA),
        Rec_Date  = map2_chr(Gift_details, gd_splits, value_row_split, n=2, offset = NA), 
        Value = map2_chr(Gift_details, gd_splits, value_row_split, n=3, offset = NA),
        Disposition = map2_chr(Gift_details, gd_splits, value_row_split, n=4, offset = NA)  
    ) %>%
    ## Cleanup
    #select(-Gift_details, -gd_splits) %>%
    return()  
}

x5 %>% extract_gift_details(patterns = gd_patterns) -> x6
x6 %>% head(5)
```

### Step 5: Final clean-up

1. Simple clean up of `Gift` and `Disposition`

```{r gift_detail_cleanup01, echo=TRUE, message=FALSE}
x6 %>% 
  mutate(
      Gift = Gift %>% trimws(),
      Disposition = Disposition %>% map(function(y) {substring(y, first = regexpr(pattern = '\u2014', text = y)[1]+1)}) %>% trimws()
    ) -> x6
```

2. Identify and extract multiple date patterns:

```{r gift_detail_cleanup02, echo=TRUE, message=FALSE}
short_month_patterns <- '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)'
long_month_patterns <- '(January|February|March|April|May|June|July|August|Sep[ ]{0,1}tember|October|November|December)'
us_yyyy_date_pattern <- '[0-9]{1,2}[ ]{0,1}/[ ]{0,1}[0-9]{1,2}[ ]{0,1}/[ ]{0,1}20[0-9]{2}'
us_yy_date_pattern <- '[0-9]{1,2}[ ]{0,1}/[ ]{0,1}[0-9]{1,2}[ ]{0,1}/[ ]{0,1}[0-9]{2}'
oracle_date_pattern <- paste0('[0-9]{1,2}[ ]{0,1}(-|\u2013)[ ]{0,1}', short_month_patterns, '[ ]{0,1}(-|\u2013)[ ]{0,1}[0-9]{2}')
long_date_pattern <- paste0(long_month_patterns,'[ ]{0,1}[0-9]{1,2}[,]{0,1}[ ]{0,1}20[0-9]{2}')
no_day_date_pattern <- paste0(long_month_patterns,' 20[0-9]{2}')

x6 %>%
  mutate(
    Rec_Date_US_yyyy = as.Date(gsub(' ', '', map_chr(Rec_Date, regex_extract, us_yyyy_date_pattern)), '%m/%d/%Y'),
    Rec_Date_US_yy = as.Date(gsub(' ', '', map_chr(Rec_Date, regex_extract, us_yy_date_pattern)), '%m/%d/%y'),
    Rec_Date_Oracle = as.Date(gsub('\u2013', '-', gsub(' ', '', map_chr(Rec_Date, regex_extract, oracle_date_pattern))), '%d-%b-%y'),
    Rec_Date_Long = as.Date(gsub(' ', '', map_chr(Rec_Date, regex_extract, long_date_pattern)), '%B%d,%Y'),
    Rec_Date_No_Day = as.Date(paste0('01 ', map_chr(Rec_Date, regex_extract, no_day_date_pattern)), '%d %B %Y')
  ) %>% 
  mutate(Gift_Rec_Date = 
    if_else(!is.na(Rec_Date_US_yyyy), Rec_Date_US_yyyy, 
    if_else(!is.na(Rec_Date_US_yy), Rec_Date_US_yy, 
    if_else(!is.na(Rec_Date_Oracle), Rec_Date_Oracle, 
    if_else(!is.na(Rec_Date_Long), Rec_Date_Long, Rec_Date_No_Day))))
  ) -> x6
``` 

3. Identify and extract USD values of gifts:

```{r gift_detail_cleanup03, echo=TRUE, message=FALSE}
x6 %>%
  mutate(
    Value_tmp1 = Value %>% map_chr(regex_extract, '[$][ ]{0,1}\\d{1,3}(,\\d{3})*(\\.\\d+)?'),
    Value_tmp2 = Rec_Date %>% map_chr(regex_extract, '[$][ ]{0,1}\\d{1,3}(,\\d{3})*(\\.\\d+)?')
  ) %>%
  mutate(Value_USD = gsub('[$]', '', if_else(!is.na(Value_tmp1), Value_tmp1, Value_tmp2))) -> x6
```

4. Filter rows with table headers rows of missing details in text (see list of issues below for details) and remove the "technical" columns

```{r gift_detail_cleanup04, echo=TRUE, message=FALSE}
x6 %>%
  filter(!is.na(Gift_Rec_Date) & !is.na(Value_USD)) %>%
  ## Cleanup of non-informative columns
  select(-Gift_details, -control_text, -gd_splits, -starts_with('Rec_Date'), -starts_with('Value_tmp'), -Value) -> x6

head(x6)
```

### Step 6: Rinse and repeat...

Running through the different PDF links we can aggregate data from multiple years. The one issue I found was that the table names for presidential gift has changed over the years, but the fact that we used the tidy approach makes this wonderfully readable. You can actually pipe all function to one another in a single `mutate` statement but I preferred to split the steps for debugging.

```{r full_gift_extract, echo=TRUE, message=FALSE} 
link_table %>%
  mutate(text_df = pdf_link %>% map(get_pdf_table_text, agency_select_pattern = agency_select_pattern)) %>%
  ## for testing only: readRDS('gifts01.RDS') %>%
  mutate(text_df = text_df %>% map(split_row_text4, column_patterns = column_patterns)) %>%
  ## Filter some problems on 2003 pages 31,32
  mutate(text_df = text_df %>% map(filter, !is.na(Receiver))) %>%
  mutate(text_df = text_df %>% map(group_by_actual_row)) %>%
  mutate(text_df = text_df %>% map(mutate,
    row_id = as.numeric(row_id),
    Receiver = clean_col_text(Receiver),
    Gift_details = clean_col_text(Gift_details),
    From = clean_col_text(From),
    Justification = clean_col_text(Justification)
  )) %>%
  mutate(text_df = text_df %>% map(extract_gift_details, patterns = gd_patterns)) %>%
  unnest(text_df) %>% 
  ## simple clean-ups of Gift & Disposition
  mutate(
      Gift = Gift %>% trimws(),
      Disposition = Disposition %>% map(function(y) {substring(y, first = regexpr(pattern = '\u2014', text = y)[1]+1)}) %>% trimws()
    ) %>%
  ## More complex clean-ups: extract dates from multiple formats
  mutate(
    Rec_Date_US_yyyy = as.Date(gsub(' ', '', map_chr(Rec_Date, regex_extract, us_yyyy_date_pattern)), '%m/%d/%Y'),
    Rec_Date_US_yy = as.Date(gsub(' ', '', map_chr(Rec_Date, regex_extract, us_yy_date_pattern)), '%m/%d/%y'),
    Rec_Date_Oracle = as.Date(gsub('\u2013', '-', gsub(' ', '', map_chr(Rec_Date, regex_extract, oracle_date_pattern))), '%d-%b-%y'),
    Rec_Date_Long = as.Date(gsub(' ', '', map_chr(Rec_Date, regex_extract, long_date_pattern)), '%B%d,%Y'),
    Rec_Date_No_Day = as.Date(paste0('01 ', map_chr(Rec_Date, regex_extract, no_day_date_pattern)), '%d %B %Y')
  ) %>% 
  mutate(Gift_Rec_Date = 
    if_else(!is.na(Rec_Date_US_yyyy), Rec_Date_US_yyyy, 
    if_else(!is.na(Rec_Date_US_yy), Rec_Date_US_yy, 
    if_else(!is.na(Rec_Date_Oracle), Rec_Date_Oracle, 
    if_else(!is.na(Rec_Date_Long), Rec_Date_Long, Rec_Date_No_Day))))
  ) %>% 
  ## Extract USD values
  mutate(
    Value_tmp1 = Value %>% map_chr(regex_extract, '[$][ ]{0,1}\\d{1,3}(,\\d{3})*(\\.\\d+)?'),
    Value_tmp2 = Rec_Date %>% map_chr(regex_extract, '[$][ ]{0,1}\\d{1,3}(,\\d{3})*(\\.\\d+)?')
  ) %>%
  mutate(Value_USD = gsub('[$]', '', if_else(!is.na(Value_tmp1), Value_tmp1, Value_tmp2))) %>%
  ## Filter remaining headers + 3 gifts with some missing details in text (see list of issues for details)
  filter(!is.na(Gift_Rec_Date) & !is.na(Value_USD)) %>%
  ## Cleanup of non-informative columns
  select(-Gift_details, -control_text, -gd_splits, -starts_with('Rec_Date'), -starts_with('Value_tmp'), -Value) -> gifts

write_csv(gifts, paste0('protocol_gifts_',min(gifts$pdf_year), '_', max(gifts$pdf_year),'.csv'))
```

## Current Issues:

2. The `Justification` column does not split correctly for about 2.5% of row (about 500 out of 20k rows in table prior to joining lines together). This will influence the text of this column and the `From` column, and also means that text quality does not allow analysis.
3. In older files, the `Receiver` column is not filled for every gift (the receiver is mentioned only one time per page). This caused some gifts to be bundeled together (I exctracted details only for the first gift). You can explore these cases by looking at columns with a very long `Gift` or `Gift_detail` column.
4. Exclusions:
  1. Last 2 pages of 2003 PDF do not parse correctly. I excluded them with the filter `is.na(Receiver)`.
  2. Problematic dates: 8 gifts were excluded from the analysis because their dates did not follow any logical pattern (e.g 19/2-4/2005). You can remove/invert the `filter(!is.na(Gift_Rec_Date))` and look through the values to find them (other lines would be table headers).
  3. No Value: 1 gift ("round table made of hardened lava") from 17/01/2003 did not have an evaluation.