Skip to content

Latest commit

 

History

History
249 lines (174 loc) · 11.2 KB

backend.md

File metadata and controls

249 lines (174 loc) · 11.2 KB

oeps-backend

This project builds from the Opioid Environment Policy Scan (OEPS) data warehouse stored in github.com/GeoDaCenter/opioid-policy-scan, and published on Zenodo at doi.org/10.5281/zenodo.5842465. This repo allows us to pull the final data from the latest OEPS release and push it into Google BigQuery, which will enable new ways of accessing and analyzing that data.

The backend includes commands for managing data transformation to and from to different destinations. These commands are organized thematically into the following groups:

Getting Started

Install the Python Package

  1. Create and activate a Python virtual environment with venv, Conda, or your other tool of choice.

  2. Clone this repo

     git clone https://github.com/healthyregions/oeps
     cd oeps/backend
    
  3. Install this package and its dependencies

     pip install -e .
    
  4. You can now run any scripts with

     python ./scripts/script_name.py
    

JCOIN

section in progress

Google BigQuery

Setup BigQuery Credentials

To use the BigQuery script (scripts/bq.py) you will need to add access credentials as environment variables using the .env file. This file must always stay out of version control.

  • Make a copy of .env.example and name it .env. Any variables defined in this file will now be available via os.getenv('VARIABLE_NAME')

  • Obtain a set of JSON credentials for the project, and store the file anywhere on your computer (but outside of this repository).

  • In your .env file, update the BQ_CREDENTIALS_FILE_PATH variable with the full path to this file. For example:

    BQ_CREDENTIALS_FILE_PATH="/home/my_username/bq_credentials/oeps-391119-5783b2b59b83.json"
    
  • It is also possible to set BigQuery credentials without storing a local JSON file. More info on this is in the .env.example file.

Once you have setup these credentials, you can test them with

python ./scripts/bq.py check-credentials

If all is well, the command will print ok.

BigQuery Import/Export

The bq.py script can perform import and export operations on our BigQuery database.

Importing Data

Use the following command to load a new table into BigQuery:

python ./scripts/bq.py load --source ./resources

Where --source is the path to a Data Resource schema (stored as JSON file), or a directory containing multiple Data Resource schemas. Optional flags on this command are:

  • --table-only will create the BigQuery dataset and table based on the schema, but will not attempt to load data into it.
  • --dry-run will validate the input dataset against the schema, but not attempt to load it.
  • --overwrite will drop and recreate the BigQuery table if it already exists in the dataset.

Exporting Data

Use the following command to query the OEPS BigQuery tables:

python scripts/bq.py export --sql sql/states.sql --output states.shp

Where states.sql is an example of a file that holds the SQL query to perform against one or more tables. In the SQL, PROJECT_ID is a placeholder (it will be replaced with the actual project identifier before the query is performed), such that table references look like PROJECT_ID.dataset_name.table_name, or PROJECT_ID.spatial.states2018 for the table that holds state boundaries.

  • --sql-file path to a file whose contents is a complete SQL query.
  • --output is the name of a file to which the query results will be written. Either .csv or .shp files can be specified, and if a spatial result is written to CSV the geometries will be in WKT format. If this argument is omitted, the query results will be printed to the console (helpful for testing queries).

You can write your own SQL into a file and use the same command to perform your query and export the results.

Use the BQ-Reference page for quick access to all table and column names.

Table Definitions

A table definition is a JSON file that specifies

  1. The location of a source dataset to load
  2. The Google BigQuery project and table name to load into
  3. A thorough schema defining all fields from the source dataset and how they will be stored in BigQuery

This information is used in various contexts to

  1. Create (or re-create) table schemas in BigQuery
  2. Load data into these tables
  3. Export data from BigQuery into various formats and file types

The structure of a table definition is inspired by the Table Schema specification published by Frictionless Standards, with a few additions for our own use case.

Structure

The top-level properties of a table definition are:

Property Format Description
path String Path or URL for CSV or SHP dataset to load
bq_table_name String Target table in BigQuery
bq_dataset_name String Target dataset in BigQuery
fields List List of definitions for all table fields

Note that in BigQuery, a dataset is akin to a database in other RDBS implementations, such that a dataset holds one or more tables. Often, tables are identified by their fully-qualified identifier: project_id.dataset_name.table_name.

The fields property is a list of one or more field objects, as described below. The only requirement of the fields list is that must contain an entry for a HEROP_ID field. This is our unique GIS join field.

Field Descriptors

Fields are defined by a JSON object that adheres to the field descriptors portion of the table schema standard, though not all possible attributes are required or implemented.

Property Format Description OEPS Use
name String Canonical name for this column (used in BigQuery) Required
title String A nicer human readable label or title for the field Required
type String A string specifying the type. See types. Required
format String A string specifying a format Not Implemented
example String An example value for the field Optional
description String A description for the field Optional
constraints JSON A constraints descriptor Not Implemented

The following additional attributes are also supported and in some cases required:

Property Format Description OEPS Use
src_name String Name of column in source dataset Required
bq_data_type String Field type for BigQuery schema Required
theme String One of Social, Environment, Economic, Policy, Outcome, or Geography. See OEPS docs. Optional
comment String Additional information about the data in this field Optional
source String Source of the data in this field Optional
max_length Integer Max length of field (used in BigQuery schema) Optional

HEROP_ID Field

For now, please see this comment for a detailed description of how HEROP_ID values are constructed.

A field descriptor for this field will look something like this:

{
    "name": "HEROP_ID",
    "src_name": "HEROP_ID",
    "type": "string",
    "example": "040US01-2018",
    "description": "A derived unique id corresponding to the relevant geographic unit.",
    "theme": "Geography",
    "bq_data_type": "STRING"
}

Geometry Fields

To load a shapefile you must include the following field descriptor in your fields list:

{
    "name": "geom",
    "title": "Geom",
    "type": "string",
    "src_name": "geometry",
    "bq_data_type": "GEOGRAPHY"
}

Note that for spatial data Table Schema only allows geojson or geopoint as valid geographic types, while Google BigQuery uses GEOGRAPHY. For now, we'll just use the above configuration, and more nuances can be pursued down the road.

Example

The following is a truncated version of a table definition for the 2010 State-level data published in OEPS v2.0. This defines a table project_id.tabular.S_2010 with two fields, HEROP_ID and TotPop. Note also the direct URL to the raw path on GitHub.

{
    "bq_dataset_name": "tabular",
    "bq_table_name": "S_2010",
    "path": "https://raw.githubusercontent.com/GeoDaCenter/opioid-policy-scan/main/data_final/full_tables/S_2010.csv",
    "fields": [
        {
            "name": "HEROP_ID",
            "src_name": "HEROP_ID",
            "type": "string",
            "description": "A derived unique id corresponding to the relevant geographic unit.",
            "constraints": null,
            "theme": "Geography",
            "bq_data_type": "STRING"
        },
        {
            "name": "TotPop",
            "src_name": "TotPop",
            "type": "integer",
            "example": "7294336",
            "description": "Estimated total population",
            "constraints": null,
            "theme": "Social",
            "source": "American Community Survey 2014-2018 5 Year Estimates; 2010 Decennial Census; Integrated Public Use Microdata Service National Historic Geographic Information Systems",
            "comments": "1980, 1990, and 2000 data from respective decennial censuses downloaded from IPUMS NHGIS and aggregated upwards.",
            "bq_data_type": "INTEGER"
        }
    ]
}

US Census Data

A processing pipeline pulls cartographic boundary files from the Census FTP site and does the following:

  • Merges all state subsets into a single nation-wide dataset
  • Adds a couple of useful fields
  • Exports to GeoJSON, Shapefile, and PMTiles formats
  • Uploads to our S3 bucket for remote access

Only 500k resolution files are used.

A comprehensive example of the command is:

flask census get-geodata shp pmtiles geojson -y 2010 --upload --no-cache --verbose --tippecanoe-path /opt/tippecanoe/tippecanoe
  • census is the command group
  • get-geodata is this particular operation
  • shp indicates that shapefiles will be generated. other valid formats are geojson and pmtiles
  • -g tract (multiple allowed) specifies that tract geographies should be processed. other geographies are:
    • state
    • county
    • zcta (zip code tabulation area)
    • bg (block group)
    • place (place geographies: cities, towns, villages)
  • -y 2010 indicates that 2010 files should be used. 2018 is also supported.
  • --upload (optional) will upload to S3 (credentials and bucket name are set elsewhere)
  • --no-cache (optional) will force re-download of the source files from the FTP
  • --verbose (optional) extra print statements during the process
  • --tippecanoe-path (required for pmtiles output) provide a full path to a local tippecanoe binary, used to generate PMTiles

Overture POIs

section in progress