Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run LASD historical scrape #324

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from
Draft

How to run LASD historical scrape #324

wants to merge 8 commits into from

Conversation

erika-tyagi
Copy link
Member

@erika-tyagi erika-tyagi commented Sep 27, 2021

Opening this up as a draft PR since it should NOT be merged to the main branch!! – I just needed a place to add comments/instructions. Normal command line syntax to run (below):

To run redo-scrape:

Rscript production/redo_scrape/main_redo.R --scraper lasd --start 2020-11-01 --end 2020-11-04

To run WBM scrape:

./production/wayback_scrape/main_wayback.R -sc lasd -st 2020-11-01 -en 2021-11-02

Steps to take after scraping but before handing off to volunteers to minimize manual cleaning/entry:

  • Combine extracted files (something like this should work):
df <- list.files("results/extracted_data/", full.names = T) %>%
    lapply(function(x){
        df_ <- fread(x)
        if(nrow(df_) == 0){
            df_ <- data.table()
        }
        if("Date" %in% names(df_)){
            df_[,Date := lubridate::as_date(Date)]
        }
        df_
    }) %>%
    rbindlist(fill=TRUE, use.names = TRUE) 
  • If the lag and lead of an NA value are the same, replace the NA with that value
  • Plot each metric over time to visually inspect anomalies

scraper$validate_extract()
# scraper$validate_extract()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting out the validate here to avoid dropping columns we want to keep!

if(lubridate::wday(current_date) %in% c(1, 2, 4, 6)){
cat("On date", as.character(current_date), "\n")
# if(lubridate::wday(current_date) %in% c(1, 2, 4, 6)){
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this to pull ALL days (regardless of day-of-the-week).

scraper$save_raw()
scraper$restruct_raw()
scraper$extract_from_raw()
# scraper$validate_extract()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing as above – don't want to validate to avoid dropping columns.

# 3. EXTRACT TABLES
# --------------------------------------------------------------------------

ex_ <- ExtractTable(x)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we use ExtractTable here (which we don't do in the main scraper). I couldn't find a better way to pull the data from the tables with any reliability unfortunately. Obviously ExtractTable is cheap but not free (and reimbursement is still unclear).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants