-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start Airtable automation #366
Conversation
… need to adjust the offshore wind extraction code to work with json
- Still need to create logic for updating the settings file - Need to figure out how to better handle airtable dtype and missing column issue Airtable API does not return column in records if there aren't any values. This is problematic if an entire column is missing values.:
…mate-partners/deployment-gap-model into init-airtable-automation
…mate-partners/deployment-gap-model into init-airtable-automation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some small-to-medium changes that you can take or leave - nothing so 😱 that I need to request changes! But also if you make some of those changes I don't think they necessarily need a re-review.
|
||
env: | ||
API_KEY_GOOGLE_MAPS: ${{ secrets.API_KEY_GOOGLE_MAPS }} | ||
GITHUB_REF: ${{ github.ref_name }} # This is changed to dev if running on a schedule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll get to remove this logic if we end up removing dev
- but for now this seems OK.
- name: Display env variables | ||
run: | | ||
echo "Workspace directory: $GITHUB_WORKSPACE" \ | ||
echo "Google credentials path: $GOOGLE_GHA_CREDS_PATH" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: Haven't set up WIF yet, yeah? We're just using an access key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet :/ I'll add it to the list of infra improvements.
"gs://dgm-archive/synapse/offshore_wind/offshore_wind_locations_2024-09-10.csv" | ||
) | ||
# get the latest version of the offshore wind data from the candidate yaml file | ||
projects_uri = "airtable/Offshore Wind Locations Synapse Version/Projects.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if these get immediately fed into es
and overwritten, should we just inline these values on lines 127 and 128?
dataset_ref = client.dataset(dataset_id) | ||
|
||
# get all parquet files in the bucket/{version} directory | ||
blobs = output_bucket.list_blobs(prefix=f"{version}/{destination_blob_prefix}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: You could avoid the if statement on line 87 if you add a match_glob="*.parquet"
here, I think.
for blob in blobs: | ||
if blob.name.endswith(".parquet"): | ||
# get the blob filename without the extension | ||
table_name = blob.name.split("/")[-1].split(".")[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: could make it a Path
then use Path.stem
here.
…mate-partners/deployment-gap-model into init-airtable-automation
This PR implements Phase I of the Airtable automation project
Specifically this PR:
ExtractionSettings
which acts as the interface between the archives and the ETL (kind of like theDatastore
in PUDL). It loads archive version numbers from a settings file and can grab the latest version numbers from GCS.run-full-build.yml
action ran the ETL when new commits were pushed to a branch and would update GCS and BQ if was started onmain
ordev
. I decided to make this action just be CI and create a new action responsible for CD.run-full-build.yml
action ran ondev
or a tag was pushed. Each BQ table has aversion
label which points to the name of the directory in GCS (dev
,v2024.03.05
...). Now that we'll have new outputs every day, I decided to name each new output with a UUID and use that for the BQ table version number. TheOutputMetadata
class collects metadata about the outputs like git sha of the code, date created, etc. and saves it as a yaml file in the output directory on GCS.dbcp.cli.py
script