Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dev -> main 12-16-2024 #391

Merged
merged 11 commits into from
Dec 16, 2024
116 changes: 96 additions & 20 deletions .github/workflows/update-data.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,33 +3,107 @@ name: update-data

on:
push:
tags:
- "v20*"
branches:
- "dev"
- "main"
workflow_dispatch:
# Temporarily disable scheduled on runs because geocoding without a cache everytime is expensive
# schedule:
# - cron: 5 7 * * 1-5

jobs:
build:
archive:
if: github.event_name == 'schedule' || github.event_name == 'workflow_dispatch'
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
# This will checkout the main branch on "schedule" and the specified branch on "workflow_dispatch"
# The archiver should probably be pulled out into its own repo so archive code and data don't diverge

- name: Who owns the workspace?
run: ls -ld $GITHUB_WORKSPACE

- uses: "google-github-actions/auth@v2"
with:
credentials_json: "${{ secrets.DGM_GITHUB_ACTION_CREDENTIALS }}"

- name: Display env variables
run: |
echo "Workspace directory: $GITHUB_WORKSPACE" \
echo "Google credentials path: $GOOGLE_GHA_CREDS_PATH" \

# Give the dbcp user ownership of the workspace
# So it can read and write files to the workspace
- name: Give the dbcp user ownership of the workspace
run: sudo chown -R 1000:1000 $GITHUB_WORKSPACE

- name: Set up Docker Compose
run: |
sudo apt-get update
sudo apt-get install -y docker-compose

- name: Build and run Docker Compose services
run: |
docker-compose up -d

- name: Run the archive
env:
AIRTABLE_API_KEY: ${{ secrets.AIRTABLE_API_KEY }}
run: |
make archive_all

# The google-github-actions/auth step is run as runner:docker,
# so we need to give the workspace back to runner:docker
- name: Give ownership of the workspace back to root
if: always()
run: sudo chown -R runner:docker $GITHUB_WORKSPACE

- name: Who owns the workspace?
if: always()
run: ls -ld $GITHUB_WORKSPACE

matrix_prep:
needs: archive # Ensure archive job finishes first
# Only run if the archive job is successful or is skipped
# I had to add always() because the matrix_pre job wouldn't run if the archive job was skipped
# I think this happens because archive is skipped on push, but matrix_prep is not
if: ${{ always() && (needs.archive.result == 'success' || needs.archive.result == 'skipped') }}
runs-on: ubuntu-latest
outputs:
matrix: ${{ steps.set-matrix.outputs.matrix }}
steps:
- name: Set branch dynamically
id: set-matrix
run: |
if [ "${{ github.event_name }}" == "push" ]; then
echo "matrix={\"include\":[{\"branch\":\"${{ github.ref_name }}\"}]}" >> $GITHUB_OUTPUT
else
echo "matrix={\"include\":[{\"branch\":\"main\"},{\"branch\":\"dev\"}]}" >> $GITHUB_OUTPUT
fi

- name: echo matrix
run: echo ${{ steps.set-matrix.outputs.matrix }}

etl:
needs: matrix_prep # Ensure archive job finishes first
runs-on: ubuntu-latest
if: ${{ always() && needs.matrix_prep.result == 'success' }}
strategy:
matrix: ${{ fromJSON(needs.matrix_prep.outputs.matrix) }}
fail-fast: false
env:
API_KEY_GOOGLE_MAPS: ${{ secrets.API_KEY_GOOGLE_MAPS }}
GITHUB_REF: ${{ github.ref_name }} # This is changed to dev if running on a schedule

steps:
- name: Use dev branch if running on a schedule
if: ${{ (github.event_name == 'schedule') }}
run: |
echo "This action was triggered by a schedule." && echo "GITHUB_REF=dev" >> $GITHUB_ENV
- name: print matrix
run: echo ${{ matrix.branch }}

- name: Checkout Repository
id: checkout
uses: actions/checkout@v4
with:
ref: ${{ env.GITHUB_REF }}
ref: ${{ matrix.branch }}

- name: Who owns the workspace?
run: ls -ld $GITHUB_WORKSPACE
Expand Down Expand Up @@ -57,18 +131,13 @@ jobs:
run: |
docker-compose up -d

- name: Run the archive
env:
AIRTABLE_API_KEY: ${{ secrets.AIRTABLE_API_KEY }}
run: |
make archive_all

- name: Run full ETL
if: ${{ success() }}
run: |
make all

- name: Run all test
if: ${{ success() }}
run: |
make test

Expand Down Expand Up @@ -100,14 +169,21 @@ jobs:

# publish the outputs, grab the git sha of the commit step
- name: Publish publish outputs
if: (github.event_name == 'push' && startsWith(github.ref, 'refs/tags/')) || (github.ref_name == 'dev')
run: |
# Use the commit_settings_file hash if settings.yaml was updated
# If it wasn't updated that means there were no changes so use the
# commit hash from the checkout step
SETTINGS_FILE_SHA="${{ steps.commit_settings_file.outputs.commit_long_sha }}"
if [ -z "$SETTINGS_FILE_SHA" ]; then
SETTINGS_FILE_SHA="${{ steps.checkout.outputs.commit }}"
fi

docker compose run --rm app python dbcp/cli.py publish-outputs \
-bq \
--build-ref ${{ github.ref_name }} \
--code-git-sha ${{ github.sha }} \
--settings-file-git-sha ${{ steps.commit_settings_file.outputs.commit_long_sha }} \
--github-action-run-id ${{ github.run_id}}
--build-ref ${{ matrix.branch }} \
--code-git-sha ${{ steps.checkout.outputs.commit }} \
--settings-file-git-sha $SETTINGS_FILE_SHA \
--github-action-run-id ${{ github.run_id }}

- name: Stop Docker Compose services
if: always()
Expand Down
8 changes: 4 additions & 4 deletions src/dbcp/commands/publish.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ def load_parquet_files_to_bigquery(

# Get the BigQuery dataset
# the "production" bigquery datasets do not have a suffix
destination_suffix = "" if build_ref.startswith("v20") else f"_{build_ref}"
destination_suffix = "" if build_ref == "main" else f"_{build_ref}"
dataset_id = f"{destination_blob_prefix}{destination_suffix}"
dataset_ref = client.dataset(dataset_id)

Expand Down Expand Up @@ -136,12 +136,12 @@ class OutputMetadata(BaseModel):

@validator("git_ref")
def git_ref_must_be_dev_or_tag(cls, git_ref: str | None) -> str | None:
"""Validate that the git ref is either "dev" or a tag starting with "v20"."""
"""Validate that the git ref is either "dev" or "main"."""
if git_ref:
if (git_ref in ("dev", "sandbox")) or git_ref.startswith("v20"):
if git_ref in ("dev", "sandbox", "main"):
return git_ref
raise ValueError(
f'{git_ref} is not a valid Git rev. Must be "dev" or start with "v20"'
f'{git_ref} is not a valid Git rev. Must be "dev" or "main".'
)
return git_ref

Expand Down
8 changes: 4 additions & 4 deletions src/dbcp/settings.yaml
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
- generation_num: 1733790060771958
- generation_num: 1734321671975051
metadata:
schema_generation_number: '1733790053549029'
schema_generation_number: '1734321670799123'
table_id: tblu7b4Yj58Iq2xCF
name: airtable/Offshore Wind Locations Synapse Version/Locations.json
pinned: false
- generation_num: 1733790059992574
- generation_num: 1734321671190442
metadata:
schema_generation_number: '1733790053549029'
schema_generation_number: '1734321670799123'
table_id: tblVU9JPtahIwLbny
name: airtable/Offshore Wind Locations Synapse Version/Projects.json
pinned: false
20 changes: 18 additions & 2 deletions src/dbcp/transform/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -342,13 +342,29 @@ def add_county_fips_with_backup_geocoding(

# geocode the lookup failures - they are often city/town names (instead of counties) or simply mis-spelled
nan_fips = with_fips.loc[fips_is_nan, :].copy()
geocoded = _geocode_locality(
nan_fips.loc[:, [state_col, locality_col]],

# Deduplicate on the state and locality columns to minimize API calls
key_cols = [state_col, locality_col]
deduped_nan_fips = nan_fips.loc[:, key_cols].drop_duplicates()
deduped_geocoded = _geocode_locality(
deduped_nan_fips,
# pass subset to _geocode_locality to maximize chance of a cache hit
# (this way other columns can change but caching still works)
state_col=state_col,
locality_col=locality_col,
)
# recombine deduped geocoded data with original nan_fips
geocoded_deduped_nan_fips = pd.concat(
[deduped_nan_fips[key_cols], deduped_geocoded], axis=1
)
index_name = nan_fips.index.name
index_name = index_name if index_name is not None else "index"
geocoded = (
nan_fips.reset_index()
.merge(geocoded_deduped_nan_fips, on=key_cols, how="left", validate="m:1")
.set_index(index_name)[deduped_geocoded.columns]
)

nan_fips = pd.concat([nan_fips, geocoded], axis=1)
# add fips using geocoded names
filled_fips = add_fips_ids(
Expand Down
Loading