Runner pipeline stages

Every site being processed (e.g., ca/sf_gov's data feed) goes through a three-stage pipeline called a runner. After the runner is complete, it gets loaded into our production database.

The stages are intentionally isolated - you can work on as many stages as you want (for example: it is okay to contribute a fetch stage without a parse stage).

Fetch

purpose	inputs	outputs
retrieve raw data from an external source and stores it unmodified	path to output directory	arbitrary file types (`.json`, `.zip`, `.html`, etc.), written to output directory

We store the unmodified data to allow offline work on the parse and normalize stages, as well as to enable re-processing of the pipeline in the event that bugs are discovered and fixed.

As a sample, here is the fetch stage for ca/sf_gov:

#!/usr/bin/env bash

set -Eeuo pipefail

output_dir=""
if [ -n "${1}" ]; then
    output_dir="${1}"
else
    echo "Must pass an output_dir as first argument"
fi

(cd "$output_dir" && curl --silent "https://vaccination-site-microservice.vercel.app/api/v1/appointments" -o 'sf_gov.json')

You can find information on how to find the right ArcGIS feeds and write a fetch.py in our documentation on ArcGIS runners

Parse

purpose	inputs	outputs
convert raw data to json records, stored as `ndjson`	path to input directory, path to output directory	`.ndjson` files, written to output directory

Many sites offer a json feed during the fetch stage, which makes the parse stage simple (the only step is conversion to ndjson).

For pure json->ndjson conversion, we offer a shared parser. Details on its usage are coming soon, but for now check out the md/arcgis parse.yml which triggers shared parsing logic.

Normalize

purpose	inputs	outputs
transform parsed `.ndjson` into the Vaccinate the States schema	path to input directory, path to output directory	normalized `.ndjson` files, written to output directory

Most fields of the schema are optional - fill out as much as you can!

Check out ak/arcgis normalize.py for a sample normalizer.

Match & Load

purpose	inputs	outputs
match normalized `.ndjson` rows with known vaccination locations in our database, or create new locations as needed.	n/a	n/a

This stage interacts with VIAL - our production Vaccine Information Archive and Library. For each row in the normalized .ndjson, match & load attempts to identify if the location is already known. If it is known, link the new information to the known location, so VIAL can serve up to date data. If it is not known, create a new location.

⚠️ development on the match & load stage requires access to VIAL. New contributors should focus on one of the prior stages! ⚠️

If you are a team member looking for more information about developing on this stage, see the readme.

Contributing

Shared tooling

Some pre-built tools to help fetch or parse common data types

Provide feedback

Saved searches

Use saved searches to filter your results more quickly