Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use geocodio instead of google maps #394

Merged
merged 13 commits into from
Feb 5, 2025
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/test-full-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ jobs:
runs-on: ubuntu-latest

env:
GEOCODIO_API_KEY: ${{ secrets.GEOCODIO_API_KEY }}
API_KEY_GOOGLE_MAPS: ${{ secrets.API_KEY_GOOGLE_MAPS }}

steps:
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/update-data.yml
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@ jobs:
matrix: ${{ fromJSON(needs.matrix_prep.outputs.matrix) }}
fail-fast: false
env:
GEOCODIO_API_KEY: ${{ secrets.GEOCODIO_API_KEY }}
API_KEY_GOOGLE_MAPS: ${{ secrets.API_KEY_GOOGLE_MAPS }}
GITHUB_REF: ${{ github.ref_name }} # This is changed to dev if running on a schedule
steps:
Expand Down
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,10 +77,11 @@ export GOOGLE_GHA_CREDS_PATH=<path/to/your_credentials.json>
`GOOGLE_GHA_CREDS_PATH` will be mounted into the container so
the GCP APIs in the container can access the data stored in GCP.

You'll also need to set an environment variable for the Google Maps API Key:
You'll also need to set an environment variable for the Geocodio API Key. This api key is stored
GCP project Secret Manager as `geocodio-api-key`.

```
export API_KEY_GOOGLE_MAPS={Google Maps API key for GCP project dbcp-dev-350818}
export GEOCODIO_API_KEY={geocodio api key}
```

## Git Pre-commit Hooks
Expand Down
1 change: 1 addition & 0 deletions docker-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ services:
environment:
- API_KEY_GOOGLE_MAPS=${API_KEY_GOOGLE_MAPS} # get this value from our google account: https://console.cloud.google.com/google/maps-apis/credentials?project=dbcp-dev&supportedpurview=project
- AIRTABLE_API_KEY=${AIRTABLE_API_KEY}
- GEOCODIO_API_KEY=${GEOCODIO_API_KEY} # This api key is stored GCP project secret manager as geocodio-api-key
depends_on:
postgres:
condition: service_healthy
Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ psycopg2~=2.9.3
pytest~=6.2.5
tqdm>=4.64.1,<5.0.0
python-docx~=0.8.11
pygeocodio~=1.4.0
googlemaps~=4.5.3
pandas-gbq~=0.19.1
pydata-google-auth~=1.7.0
Expand Down
4 changes: 2 additions & 2 deletions src/dbcp/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from dbcp.commands.publish import publish_outputs
from dbcp.commands.settings import save_settings
from dbcp.transform.fips_tables import SPATIAL_CACHE
from dbcp.transform.helpers import GEOCODER_CACHE
from dbcp.transform.helpers import GEOCODER_CACHES

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -53,7 +53,7 @@ def cli(loglevel):
def etl(data_mart: bool, data_warehouse: bool, clear_cache: bool):
"""Run the ETL process to produce the data warehouse and mart."""
if clear_cache:
GEOCODER_CACHE.clear()
GEOCODER_CACHES.clear_caches()
SPATIAL_CACHE.clear()

if data_warehouse:
Expand Down
4 changes: 2 additions & 2 deletions src/dbcp/etl.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
from dbcp.extract.ncsl_state_permitting import NCSLScraper
from dbcp.helpers import enforce_dtypes, psql_insert_copy
from dbcp.transform.fips_tables import SPATIAL_CACHE
from dbcp.transform.helpers import GEOCODER_CACHE
from dbcp.transform.helpers import GEOCODER_CACHES
from dbcp.validation.tests import validate_warehouse

logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -244,7 +244,7 @@ def run_etl(funcs: dict[str, Callable], schema_name: str):
def etl():
"""Run dbc ETL."""
# Reduce size of caches if necessary
GEOCODER_CACHE.reduce_size()
GEOCODER_CACHES.reduce_cache_sizes()
SPATIAL_CACHE.reduce_size()

# Run public ETL functions
Expand Down
127 changes: 127 additions & 0 deletions src/dbcp/transform/geocodio.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
"""Geocodio geocoding functions."""

import os
from pathlib import Path

import pandas as pd
from geocodio import GeocodioClient
from joblib import Memory
bendnorman marked this conversation as resolved.
Show resolved Hide resolved
from pydantic import BaseModel

geocoder_local_cache = Path("/app/data/geocodio_cache")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of path seems ripe for configuration with an env variable instead of hard-coding. e.g. I don't have /app on my computer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True! We've been hard coding these paths because this ETL should only be run in a docker container which has a consistent file structure. I think it'd be wise to use an env var anyway! There are lots of locations in the code that reference /app/data so I'll make this change in a separate PR.

# create geocoder_local_cache if it doesn't exist
geocoder_local_cache.mkdir(parents=True, exist_ok=True)
assert geocoder_local_cache.exists()
# cache needs to be accessed outside this module to call .clear()
# limit cache size to 100 KB, keeps most recently accessed first
GEOCODER_CACHE = Memory(location=geocoder_local_cache, bytes_limit=2**19)
jdangerx marked this conversation as resolved.
Show resolved Hide resolved


class AddressComponents(BaseModel):
"""Address components from Geocodio."""

number: str = ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any of these fields we would always expect? Country maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the docs say anything about values that will always be present.

predirectional: str = ""
street: str = ""
suffix: str = ""
formatted_street: str = ""
city: str = ""
county: str = ""
state: str = ""
zip: str = "" # noqa: A003
country: str = ""


class Location(BaseModel):
"""Location from Geocodio."""

lat: float = 0.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever expect there to be default values here? Seems like we want to enforce that we always get a lat and a long, and never just fill a missing value with 0...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Removed the default value for these attributes.

lng: float = 0.0


class AddressData(BaseModel):
"""Address data from Geocodio."""

address_components: AddressComponents
formatted_address: str = ""
location: Location
accuracy: float = 0.0
accuracy_type: str = ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Geocodio always expects to give you an accuracy type + score back, so again these don't need default values. Seems also like accuracy_type has a limited set it could be drawn from, as opposed to being a string. We could probably tighten up all of these type definitions.

source: str = ""


def _geocode_batch(
batch: pd.DataFrame, client: GeocodioClient, state_col: str, locality_col: str
) -> pd.DataFrame:
"""Geocode a batch of addresses.

Args:
batch: dataframe with address components
client: GeocodioClient object
state_col: name of the state column
locality_col: name of the locality column

Returns:
dataframe with geocoded locality information
"""
batch["address"] = batch[locality_col] + ", " + batch[state_col]
results = client.geocode(batch["address"].tolist())

results_df = []
for result in results:
if "error" in result:
results_df.append(["", "", ""])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why include a bunch of rows with empty strings here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is handling the case when the API can't geocode an address. I updated it to be a list of None's.

elif result["results"]:
ad = AddressData.parse_obj(result["results"][0])
locality_type = ad.accuracy_type
if locality_type == "place":
locality_name = ad.address_components.city
locality_type = "city"
elif locality_type == "county":
locality_name = ad.address_components.county
else:
locality_name = ""
results_df.append(
[locality_name, locality_type, ad.address_components.county]
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be cleaner to add this logic to the pydantic classes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, locality_name could be a computed property, and then results_df could just be:

results_df = pd.DataFrame(
    [
        {"geocoded_locality_name": ad.locality_name, ...}
        for ad := AddressData.parse_obj(res["results"][0])
        for res in results if "results" in res
    ]
)

or the for-loop equivalent since that's a NASTY comprehension.

else:
results_df.append(["", "", ""])

results_df = pd.DataFrame(
results_df,
columns=[
"geocoded_locality_name",
"geocoded_locality_type",
"geocoded_containing_county",
],
index=batch.index,
)
return results_df


@GEOCODER_CACHE.cache()
def _geocode_locality(
state_locality_df: pd.DataFrame,
state_col: str = "state",
locality_col: str = "county",
batch_size: int = 100,
) -> pd.DataFrame:
"""Geocode locality names in a dataframe.

Args:
state_locality_df: dataframe with state and locality columns
state_col: name of the state column
locality_col: name of the locality column
batch_size: number of rows to geocode at once
Returns:
dataframe with geocoded locality information
"""
GEOCODIO_API_KEY = os.environ["GEOCODIO_API_KEY"]
client = GeocodioClient(GEOCODIO_API_KEY)

geocoded_df = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I find it confusing to have variables that aren't pd.DataFrame that end in _df. This happens in the batch munging code above too.


for start in range(0, len(state_locality_df), batch_size):
batch = state_locality_df.iloc[start : start + batch_size] # noqa: E203
geocoded_df.append(_geocode_batch(batch, client, state_col, locality_col))
return pd.concat(geocoded_df)
Original file line number Diff line number Diff line change
@@ -1,15 +1,27 @@
"""Classes and functions for geocoding address data using Google API."""

import os
from functools import lru_cache
from logging import getLogger
from pathlib import Path
from typing import Dict, List, Optional
from warnings import warn

import googlemaps
import pandas as pd
from joblib import Memory

logger = getLogger("__name__")


geocoder_local_cache = Path("/app/data/google_geocoder_cache")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments in this block as before - this hardcoded path seems ripe for env vars, and 2**19 B is not 100 KB.

geocoder_local_cache.mkdir(parents=True, exist_ok=True)
assert geocoder_local_cache.exists()
# cache needs to be accessed outside this module to call .clear()
# limit cache size to 100 KB, keeps most recently accessed first
GEOCODER_CACHE = Memory(location=geocoder_local_cache, bytes_limit=2**19)


class GoogleGeocoder(object):
"""Class to interact with Google's Geocoding API."""

Expand Down Expand Up @@ -202,3 +214,61 @@ def _get_geocode_response(
return response[0]
except IndexError: # empty list = not found
return {}


def _geocode_row(
ser: pd.Series, client: GoogleGeocoder, state_col="state", locality_col="county"
) -> List[str]:
"""Function to pass into pandas df.apply() to geocode state/locality pairs.

Args:
ser (pd.Series): a row of a larger dataframe to geocode
client (GoogleGeocoder): client for Google Maps Platform API
state_col (str, optional): name of the column of state names. Defaults to 'state'.
locality_col (str, optional): name of the column of locality names. Defaults to 'county'.

Returns:
List[str]: geocoded_locality_name, geocoded_locality_type, and geocoded_containing_county
"""
client.geocode_request(name=ser[locality_col], state=ser[state_col])
return client.describe()


@GEOCODER_CACHE.cache()
def _geocode_locality(
state_locality_df: pd.DataFrame, state_col="state", locality_col="county"
) -> pd.DataFrame:
"""Use Google Maps Platform API to look up information about state/locality pairs in a dataframe.

Args:
state_locality_df (pd.DataFrame): dataframe with state and locality columns
state_col (str, optional): name of the column of state names. Defaults to 'state'.
locality_col (str, optional): name of the column of locality names. Defaults to 'county'.

Returns:
pd.DataFrame: new columns 'geocoded_locality_name', 'geocoded_locality_type', 'geocoded_containing_county'
"""
# NOTE: the purpose of the cache decorator is primarily to
# reduce API calls during development. A secondary benefit is to reduce
# execution time due to slow synchronous requests.
# That's why this is persisted to disk with joblib, not in memory with LRU_cache or something.
# Because it is on disk, caching the higher level dataframe function causes less IO overhead
# than caching individual API calls would.
# Because the entire input dataframe must be identical to the cached version, I
# recommend subsetting the dataframe to only state_col and locality_col when calling
# this function. That allows other, unrelated columns to change but still use the geocode cache.
geocoder = GoogleGeocoder()
new_cols = state_locality_df.apply(
_geocode_row,
axis=1,
result_type="expand",
client=geocoder,
state_col=state_col,
locality_col=locality_col,
)
new_cols.columns = [
"geocoded_locality_name",
"geocoded_locality_type",
"geocoded_containing_county",
]
return new_cols
Comment on lines +217 to +274
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this logic from dbcp.transform.helpers because it is specific to the Google Maps API.

4 changes: 2 additions & 2 deletions src/dbcp/transform/gridstatus.py
Original file line number Diff line number Diff line change
Expand Up @@ -478,7 +478,7 @@ def _clean_resource_type(
resource_locations["county_id_fips"].isin(coastal_county_id_fips.keys())
& resource_locations.resource_clean.eq("Onshore Wind")
].project_id
expected_n_coastal_wind_projects = 88
expected_n_coastal_wind_projects = 81
assert (
len(nyiso_coastal_wind_project_project_ids) == expected_n_coastal_wind_projects
), f"Expected {expected_n_coastal_wind_projects} NYISO coastal wind projects but found {len(nyiso_coastal_wind_project_project_ids)}"
Expand Down Expand Up @@ -1120,7 +1120,7 @@ def transform(raw_dfs: dict[str, pd.DataFrame]) -> dict[str, pd.DataFrame]:
intermediate_creator=_prep_for_deduplication,
)
dupes = pre_dedupe - len(deduped_projects)
logger.info(f"Deduplicated {dupes} ({dupes/pre_dedupe:.2%}) projects.")
logger.info(f"Deduplicated {dupes} ({dupes / pre_dedupe:.2%}) projects.")

# Normalize data
(
Expand Down
Loading
Loading