New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Start Airtable automation #366

Merged

bendnorman merged 42 commits into dev from init-airtable-automation

Nov 26, 2024

Contributor

bendnorman commented Sep 14, 2024 •

edited

Loading

This PR implements Phase I of the Airtable automation project

Specifically this PR:

Creates classes and cli command for archiving Airtable data with the potential to do all our archiving. RIght now it's a mix of storing files in github, and updating files in GCS manually and via github actions.
Creates a class called ExtractionSettings which acts as the interface between the archives and the ETL (kind of like the Datastore in PUDL). It loads archive version numbers from a settings file and can grab the latest version numbers from GCS.
Creates a github action that: archives airtable data, runs the ETL and tests with the new data, saves and commits the new archive version numbers to the branch if the ETL is successful, then publishes the outputs to GCS and BigQuery.
Previously the run-full-build.yml action ran the ETL when new commits were pushed to a branch and would update GCS and BQ if was started on main or dev. I decided to make this action just be CI and create a new action responsible for CD.
Previously, we stored a directory of parquet files in GCS for every time the run-full-build.yml action ran on dev or a tag was pushed. Each BQ table has a version label which points to the name of the directory in GCS (dev, v2024.03.05...). Now that we'll have new outputs every day, I decided to name each new output with a UUID and use that for the BQ table version number. The OutputMetadata class collects metadata about the outputs like git sha of the code, date created, etc. and saves it as a yaml file in the output directory on GCS.
I consolidated some of our scripts by adding subcommand to our dbcp.cli.py script


          Add utils for archivers and create an airtable archiver

db41f50

bendnorman changed the base branch from main to dev

September 14, 2024 01:39

bendnorman and others added 14 commits

September 17, 2024 13:21


          Add classes for managing an archive extraction settings file. I still…

5e0870e

… need to adjust the offshore wind extraction code to work with json


          Adjust offshore wind extraction code to pull latest archives

7cee04b

- Still need to create logic for updating the settings file
- Need to figure out how to better handle airtable dtype and missing column issue
  Airtable API does not return column in records if there aren't any values.
  This is problematic if an entire column is missing values.:


          Test out airtable automation

cee26a5


          Add airtable secret to update-data github aciton

9cc87ac


          Move commit step after permissions change

8b1c86f


          Update settings.yaml

aa1347a


          Create cli commands for archiving, etl, publishing and settings manag…

f90893f

…ement


          Use uuid for parquet version numbers and add more cli options to publ…

78637b6

…ish-outputs


          Merge branch 'init-airtable-automation' of github.com:down-ballot-cli…

631cfd2

…mate-partners/deployment-gap-model into init-airtable-automation


          Add etl sub command to makefile


          Temporarily adjust github action triggers

577e14a


          Fix branches event syntax

ea7ab56


          Change the syntax back

d5f1794


          Remove workflow dispatch and schedule

5dd393c

bendnorman marked this pull request as ready for review

September 25, 2024 22:17

bendnorman added 13 commits

September 25, 2024 14:18


          Add just push'


          Remove colon from trigger

d8ff882


          Remove comments from workflow trigger

5c880ea


          Change trigger back to last working setup

de96d1d


          Change trigger back to last working setup

42e33b6


          Adjust update-data trigger

f4d2f6a


          Last working combo of triggers

4376e5a


          Add branch name to trigger

6bb95be


          Fix syntax error

ea95397


          Fix archive_all make command in github workflow

98c2330


          Add 's' to archive make command

30577ed


          Fix vsizip extract error

5f49401


          Update settings file path

37d20a3


          Update settings.yaml

35afae1

bendnorman linked an issue

that may be closed by this pull request

Create automation to update airtable data in c3 repo #365

Open

15 tasks

bendnorman added github_actions automation airtable labels

bendnorman self-assigned this

bendnorman added 3 commits

September 27, 2024 13:00


          Clean up publish script, fix cli logging issues

55f3b2a


          Add missing columns to airtable data archive

52eceb0


          Merge branch 'init-airtable-automation' of github.com:down-ballot-cli…

41ff30f

…mate-partners/deployment-gap-model into init-airtable-automation

bendnorman commented

View reviewed changes

.github/workflows/update-data.yml Outdated Show resolved Hide resolved

bendnorman commented

View reviewed changes

.github/workflows/update-data.yml Outdated Show resolved Hide resolved

bendnorman commented

View reviewed changes

.github/workflows/test-full-build.yml Outdated Show resolved Hide resolved

bendnorman and others added 2 commits

September 27, 2024 14:40


          Switch data warehouse flag default


          Update settings.yaml

0901ffd

jdangerx approved these changes

View reviewed changes

Collaborator

jdangerx left a comment

There's some small-to-medium changes that you can take or leave - nothing so 😱 that I need to request changes! But also if you make some of those changes I don't think they necessarily need a re-review.

.github/workflows/update-data.yml Outdated Show resolved Hide resolved

.github/workflows/update-data.yml

+                  env:
+                    API_KEY_GOOGLE_MAPS: ${{ secrets.API_KEY_GOOGLE_MAPS }}
+                    GITHUB_REF: ${{ github.ref_name }} # This is changed to dev if running on a schedule

Collaborator

jdangerx Nov 14, 2024

We'll get to remove this logic if we end up removing dev - but for now this seems OK.

.github/workflows/update-data.yml

+                    - name: Display env variables
+                      run: |
+                        echo "Workspace directory: $GITHUB_WORKSPACE" \
+                        echo "Google credentials path: $GOOGLE_GHA_CREDS_PATH" \

Collaborator

jdangerx Nov 14, 2024

non-blocking: Haven't set up WIF yet, yeah? We're just using an access key?

Contributor Author

bendnorman Nov 15, 2024

Not yet :/ I'll add it to the list of infra improvements.

.github/workflows/update-data.yml Show resolved Hide resolved

src/dbcp/archivers/airtable.py Show resolved Hide resolved

src/dbcp/archivers/airtable.py Show resolved Hide resolved

src/dbcp/etl.py

-                      "gs://dgm-archive/synapse/offshore_wind/offshore_wind_locations_2024-09-10.csv"
-                  )
+                  # get the latest version of the offshore wind data from the candidate yaml file
+                  projects_uri = "airtable/Offshore Wind Locations Synapse Version/Projects.json"

Collaborator

jdangerx Nov 14, 2024

nit: if these get immediately fed into es and overwritten, should we just inline these values on lines 127 and 128?

src/dbcp/extract/fips_tables.py Show resolved Hide resolved

src/dbcp/commands/publish.py

+                  dataset_ref = client.dataset(dataset_id)
+                  # get all parquet files in the bucket/{version} directory
+                  blobs = output_bucket.list_blobs(prefix=f"{version}/{destination_blob_prefix}")

Collaborator

jdangerx Nov 14, 2024

non-blocking: You could avoid the if statement on line 87 if you add a match_glob="*.parquet" here, I think.

src/dbcp/commands/publish.py

+                  for blob in blobs:
+                      if blob.name.endswith(".parquet"):
+                          # get the blob filename without the extension
+                          table_name = blob.name.split("/")[-1].split(".")[0]

Collaborator

jdangerx Nov 14, 2024

non-blocking: could make it a Path then use Path.stem here.

bendnorman added 6 commits

November 18, 2024 10:59


          Update triggers for CI/CD actions

7c06a1e


          Merge branch 'init-airtable-automation' of github.com:down-ballot-cli…

4c7b3e2

…mate-partners/deployment-gap-model into init-airtable-automation


          Merge branch 'dev' into init-airtable-automation

1d792d7


          Remove pull_request trigger from CI action

bb0ded1


          Fix reference to ArchiveData get_full_path property

f3db344


          Merge branch 'dev' into init-airtable-automation

485664b

bendnorman mentioned this pull request

Resolve BigQuery datetime issue #383

Merged

bendnorman added 2 commits

November 25, 2024 16:11


          Merge branch 'dev' into init-airtable-automation

9538ece


          Resolve pyairtable circular dependency issue by limiting pydantic ver…

ab0450b

…sion

bendnorman merged commit db1d144 into dev

1 check passed

bendnorman deleted the init-airtable-automation branch

November 26, 2024 22:23

bendnorman mentioned this pull request

Move new archiver infrastructure into the archiver repo #398

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

airtable automation github_actions