-
Notifications
You must be signed in to change notification settings - Fork 45
Runner pipeline stages
Every site being processed (e.g., ca/sf_gov
's data feed) goes through a three-stage pipeline called a runner
. After the runner is complete, it gets loaded into our production database.
This pipeline is intended to collect stable information about locations that provide vaccinations. This system does not process real-time information that changes more frequently than daily (like available appointment time slots).
The stages are intentionally isolated - you can work on as many stages as you want (for example: it is okay to contribute a fetch
stage without a parse
stage).
purpose | inputs | outputs |
---|---|---|
retrieve raw data from an external source and stores it unmodified | path to output directory | arbitrary file types (.json , .zip , .html , etc.), written to output directory |
We store the unmodified data to allow offline work on the parse and normalize stages, as well as to enable re-processing of the pipeline in the event that bugs are discovered and fixed.
As a sample, here is the fetch stage for ca/sf_gov
:
#!/usr/bin/env bash
set -Eeuo pipefail
output_dir=""
if [ -n "${1}" ]; then
output_dir="${1}"
else
echo "Must pass an output_dir as first argument"
fi
(cd "$output_dir" && curl --silent "https://vaccination-site-microservice.vercel.app/api/v1/appointments" -o 'sf_gov.json')
If you are fetching from an ArcGIS FeatureServer, you can find information on how to find the right feeds and write a fetcher in our documentation on ArcGIS runners.
purpose | inputs | outputs |
---|---|---|
convert raw data to json records, stored as ndjson
|
path to output directory, path to input directory |
.ndjson files, written to output directory |
Many sites offer a json feed during the fetch stage, which makes the parse stage simple (the only step is conversion to ndjson
).
If you are parsing from an ArcGIS FeatureServer, you can use the shared parser - find details in our documentation on ArcGIS runners.
purpose | inputs | outputs |
---|---|---|
transform parsed .ndjson into our normalized location schema
|
path to output directory, path to input directory | normalized .ndjson files, written to output directory |
Most fields of the schema are optional - fill out as much as you can! You can read more about the normalized schema here.
Check out ak/arcgis
normalize.py
for a sample normalizer.
schema.Vaccine
), please normalize them to pfizer_biontech
, moderna
, johnson_johnson_janssen
, or oxford_astrazeneca
. It is okay to leave this field blank if your source data does not have vaccine type information or if it has an ambiguous field like multiple
or fpp
("federal pharmacy partnership").
purpose | inputs | outputs |
---|---|---|
match normalized .ndjson rows with known vaccination locations in our database, or create new locations as needed. |
n/a | n/a |
This stage interacts with VIAL - our production Vaccine Information Archive and Library. For each row in the normalized .ndjson
, match & load attempts to identify if the location is already known. If it is known, link the new information to the known location, so VIAL can serve up to date data. If it is not known, create a new location.
If you are a team member looking for more information about developing on this stage, see the readme.
- Find an issue you'd like to help out with
- Set up a development environment
- Read up on how our pipeline works
- Run the pipeline locally
Some pre-built tools to help fetch or parse common data types