Skip to content

A node-based web scraper for pulling non-compliance cases from the TX HHSC website

Notifications You must be signed in to change notification settings

sa-express-news/hhsc-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What is this?

Texas Health and Human Services is responsible for compiling data on Texas DFPS "non-compliance" cases at child placing agencies and residential treatment centers for foster children. An operation can be considered out of compliance for a wide range of violations - from expired fire extinguishers to sexual abuse of foster children.

HHSC provides this data publically on the TX open data portal. Unfortunately, this provided dataset is incomplete. Specifically, it doesn't record data from child placing agency branch facilities (only their main facility). Thus, this scraper exists to generate a more thorough dataset of TX non-compliance cases.

How does it work?

The operations responsible for caring for and managing foster children in Texas – child placing agencies, etc – are all assigned an operation ID by DFPS. It appears that IDs are assigned sequentially, however, there is no logical pattern that allows us to infer under which numbers IDs exist. For example, operations may be assigned IDs under the numbers 50003, 50004 and 50005 followed by a seemingly arbitrary gap in assignment that resumes 2,000 sequential numbers later at 52003. What we do know is that, as of this writing, all IDs lie between the numbers 50,000 and 1,400,000.

This scraper works by attempting to visit the "Operation Details" pages for each possible ID you ask it to investigate. When a facility is found, if it has been cited for non-compliance cases, its details are scraped from this page. Afterwards, the facility's compliance history page is opened and each non-compliance case is scraped from the table.

All non compliance cases are aggregated into a .csv file with the name "hhsc-deficency-data". One row represents one case and includes case-specific and facility-specific columns. This file is saved to disc and pushed to the Express-News account on data.world. Before beginning the scrape, if an existing "hhsc-deficency-data" file exists on data.world, it is pulled from the sever and the newly scraped cases are added to the existing spreadsheet before being pushed back to the server.

Another file titled "hhsc-deficency-data-backup", which is just a copy of "hhsc-deficency-data" in it's state prior to the current scrape, is also pushed to the server and saved to disc to protect existing data in case something went wrong. If the shell script, run.sh, is used to make multiple scrapes at a time, each version will also be saved to disc in the temp directory.

In what format is the data saved?

Here is a data dictionary describing the format of each row in the "hhsc-deficency-data" csv file:

Facility specific columns

  • uniq_id <number>: Generated by the scraper, each row in the data has a unique ID
  • operation_id <number>: The facility's unique ID
  • operation_number <string>: Each facility also has a unique number, which often includes dashes
  • operation_type <string>: Either Child Placing Agency or General Residential Operation
  • operation_name <string>: Obvious
  • programs_provided <string>: Child Placement Agency, Child Care Services Only, Residential Treatment Center or Multiple Services
  • location_address<string>: Obvious
  • phone <string>: Obvious
  • county <string>: Obvious
  • website <string>: Obvious
  • email <string>: Obvious
  • type_of_issuance <string>: The facility's license
  • issuance_date <date>: When the facility was assigned its license
  • open_foster_homes <number>: Will be populated if this is a CPA
  • open_branch_offices <number>: Will be populated if this is a CPA
  • num_admin_penalties <number>: Will be populated if this is a CPA
  • is_main_branch <boolean>: If the facility is a GRO or the main branch of a CPA, returns true. Otherwise false.
  • corrective_action <boolean>: Has corrective action been taken against the facility
  • adverse_action <boolean>: Has adverse action been taken against the facility
  • temporarily_closed <boolean>: Has the state temporarily closed the facility?
  • num_deficiencies_cited <number>: Total number of non-compliance cases

Deficiency (non-compliance) specific columns

  • activity_date <date>: This is the date that deficiency was cited
  • standard_number_description <string>: The code used to classify the deficency
  • activity_type <string>: How did DFPS learn about this incident?
  • standard_risk_level <string>: How bad was this offense?
  • corrected_at_inspection <boolean>: Was the deficiency corrected at inspection
  • corrected_date <date>: When was the problem corrected
  • date_correction_verified <date>: When did the state verify the correction?
  • technical_assistance_given <boolean>: Did the state give assistance with correction?
  • narrative <string>: A description of the non-compliance case

How do I run the scraper?

Getting setup

  1. Clone the repo: git clone https://github.com/sa-express-news/hhsc-scraper.git
  2. Install the dependencies: npm run install
  3. Configure the data.world API connection (members of the Express News data.world team only need to complete step three, configuring .env):
    1. Create a dataset in data.world
    2. Open src/pushToServer and src/pullFromServer and configure the API URL's to match your paths
    3. Grab you API key from data.world and save it to a .env file in the repo's root
  4. Run npm run test to check if everything is working. Note that a couple end to end tests may fail if a facility has accrued more deficencies since the test was written. You might need to update or ignore accordingly. All tests are saved under spec.ts in their respective directories
  5. Run tsc to compile the TypeScript.

Running a single scrape

The scraper can be run with the command npm run scraper. If no options are passed, this command will grab the attempted-ids.json file (details on this below) from the server, figure out the last scraped ID from that list and then begin scraping IDs sequentially starting at the next possible ID above the last attempted ID. It will attempt to scrape 10,000 sequential possible IDs before shutting down again.

Scraper options

All of these are optional.

  • throttle: This is the number of IDs to attempt simultaneously. The default is 10. Keep in mind that the HHSC website is extremely buggy and attempting too many IDs at once can lead to errors, which will be logged in logs/errors.log.
  • scope: This is the number of IDs to attempt in this scrape from the starting ID. For example if start is set to 10 and scope is set to 1000, the attempted IDs will be 10-1009. This option will be ignored if finished is passed. Default is 10,000.
  • start: The ID number to start the scrape at.
  • finish: The ID number to stop the scrape at.
  • specific: This is a list of comma-separated IDs that, if specified, will be the only IDs the scraper attempts to scrape.

Running run.sh

run.sh allows you to set the scraper to make many scrapes in a row, saving the results to the temp directory under sequential file names to back them up. If you open the file, the batch variable can be modified to specify the number of times you want to run the scraper.

To make a full scrape, make sure attempted-ids.json is not reporting anything above 50,000 under last_attempted, set the batch variable to 135 and then let 'er rip. This should take between 24 and 48 hours to complete.

Understanding attempted-ids.json

attempted-ids.json is a log file that keeps track of what you've been doing with the scraper over time. If the scraper is run without a specific start ID, it will also be used to configure what ID to pick up where it left off from (if this is the first time you have run the scraper, it will start at 50,000). The file is saved in the data.world dataset and locally to the logs directory. It has the following properties:

  • last_successful <number>: The last ID under which a facility was found and successfully scraped
  • last_attempted <number>: The last attempted ID. Will be used by the scraper on restart to pick up where it left off
  • total_from_last_scrape <number>: How many non-compliance cases were found in the last scrape
  • total_in_database <number>: How many rows are in the current iteration of "hhsc-deficency-data"
  • facility_scraped_deficencies_rejected <Array<number>>: If an ID was found at a facility page but something went wrong trying to scrape the corresponding deficency page, the facility ID will be saved to this array and reattempted next time the scraper starts up
  • facility_timeout_or_alert_page <Array<number>>: If we found what looked like a facility page at this possible ID but couldn't scrape the facility details, the ID is saved to this array and attempted again at the next scrape

Technical details

The scraper is written in TypeScript. All source code is in the src directory with modules separated in sub-directories therein. Each module has a spec.ts file which includes the unit tests, written with Tape.

The facility pages are requested using Request Promise and scraped using Cheerio. The deficency pages are complex to navigate and filled with JavaScript. Thus, they are scraped with Google's Headless Chromium browser coupled with the Puppeteer library. Deficency page scrapes are only attempted when a facility is found and successfully scraped, thus minimizing usage of the less efficent headless browser approach.

Winston is used for error handling and logging.

data.world, via their API, is used as a database, allowing us to push updates in JSON format but query them in SQL via the SQL tool on their platform.

If you'd like to use this scraper external from data.world, this could be done by rewriting src/pullFromServer and src/pushToServer to suit your needs. Both of those modules accept arrays of objects as their input arguments which you could then handle as needed.

Contact lwhyte AT express-news.net for more details.

About

A node-based web scraper for pulling non-compliance cases from the TX HHSC website

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published