Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start Airtable automation #366

Merged
merged 42 commits into from
Nov 26, 2024
Merged

Start Airtable automation #366

merged 42 commits into from
Nov 26, 2024

Conversation

bendnorman
Copy link
Contributor

@bendnorman bendnorman commented Sep 14, 2024

This PR implements Phase I of the Airtable automation project

Specifically this PR:

  • Creates classes and cli command for archiving Airtable data with the potential to do all our archiving. RIght now it's a mix of storing files in github, and updating files in GCS manually and via github actions.
  • Creates a class called ExtractionSettings which acts as the interface between the archives and the ETL (kind of like the Datastore in PUDL). It loads archive version numbers from a settings file and can grab the latest version numbers from GCS.
  • Creates a github action that: archives airtable data, runs the ETL and tests with the new data, saves and commits the new archive version numbers to the branch if the ETL is successful, then publishes the outputs to GCS and BigQuery.
  • Previously the run-full-build.yml action ran the ETL when new commits were pushed to a branch and would update GCS and BQ if was started on main or dev. I decided to make this action just be CI and create a new action responsible for CD.
  • Previously, we stored a directory of parquet files in GCS for every time the run-full-build.yml action ran on dev or a tag was pushed. Each BQ table has a version label which points to the name of the directory in GCS (dev, v2024.03.05...). Now that we'll have new outputs every day, I decided to name each new output with a UUID and use that for the BQ table version number. The OutputMetadata class collects metadata about the outputs like git sha of the code, date created, etc. and saves it as a yaml file in the output directory on GCS.
  • I consolidated some of our scripts by adding subcommand to our dbcp.cli.py script

@bendnorman bendnorman changed the base branch from main to dev September 14, 2024 01:39
@bendnorman bendnorman marked this pull request as ready for review September 25, 2024 22:17
@bendnorman bendnorman linked an issue Sep 27, 2024 that may be closed by this pull request
15 tasks
@bendnorman bendnorman added github_actions Pull requests that update GitHub Actions code automation airtable labels Sep 27, 2024
@bendnorman bendnorman self-assigned this Sep 27, 2024
Copy link
Collaborator

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some small-to-medium changes that you can take or leave - nothing so 😱 that I need to request changes! But also if you make some of those changes I don't think they necessarily need a re-review.

.github/workflows/update-data.yml Outdated Show resolved Hide resolved

env:
API_KEY_GOOGLE_MAPS: ${{ secrets.API_KEY_GOOGLE_MAPS }}
GITHUB_REF: ${{ github.ref_name }} # This is changed to dev if running on a schedule
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll get to remove this logic if we end up removing dev - but for now this seems OK.

- name: Display env variables
run: |
echo "Workspace directory: $GITHUB_WORKSPACE" \
echo "Google credentials path: $GOOGLE_GHA_CREDS_PATH" \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: Haven't set up WIF yet, yeah? We're just using an access key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet :/ I'll add it to the list of infra improvements.

.github/workflows/update-data.yml Show resolved Hide resolved
src/dbcp/archivers/airtable.py Show resolved Hide resolved
src/dbcp/archivers/airtable.py Show resolved Hide resolved
"gs://dgm-archive/synapse/offshore_wind/offshore_wind_locations_2024-09-10.csv"
)
# get the latest version of the offshore wind data from the candidate yaml file
projects_uri = "airtable/Offshore Wind Locations Synapse Version/Projects.json"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if these get immediately fed into es and overwritten, should we just inline these values on lines 127 and 128?

src/dbcp/extract/fips_tables.py Show resolved Hide resolved
dataset_ref = client.dataset(dataset_id)

# get all parquet files in the bucket/{version} directory
blobs = output_bucket.list_blobs(prefix=f"{version}/{destination_blob_prefix}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: You could avoid the if statement on line 87 if you add a match_glob="*.parquet" here, I think.

for blob in blobs:
if blob.name.endswith(".parquet"):
# get the blob filename without the extension
table_name = blob.name.split("/")[-1].split(".")[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: could make it a Path then use Path.stem here.

@bendnorman bendnorman merged commit db1d144 into dev Nov 26, 2024
1 check passed
@bendnorman bendnorman deleted the init-airtable-automation branch November 26, 2024 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
airtable automation github_actions Pull requests that update GitHub Actions code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create automation to update airtable data in c3 repo
2 participants