Use geocodio instead of google maps #394

bendnorman · 2025-01-22T18:33:30Z

This PR replaces the Google Maps geocoder with the Geocodio API. Geocodio is cheaper and allows us to cache API calls.

Validation checks

97% of counties in counties_wide_format have the same renewable_and_battery_proposed_capacity_mw when geocoded with Google and Geocodio.
Here are some additional metrics for ETLs that require geocoding. This table compares # of locations that have the same fips code after being geocoded by Google and Geocodio:

dataset	% of locations that don’t have matching fips codes after addfips + geocoding	% of locations that needed geocoded that don’t have matching fips codes
offshore_wind	10%	10%
gridstatus	2%	24%
eip_infrastructure	1%	15%
columbia_local_opp	4%	10%
nrel_wind_solar_ordinances	1%	18%
lbnl_iso_queue	1%	36%

These all feel like acceptable differences to me. Typically, a small percentage of locations need to be geocoded because addFips does most of the work. The offshore wind locations might be worth looking into but we're about to update this dataset so I figured we could do it then.

Questions

Do you think these validation checks are reasonable? Anything you think I should dig into a little more?
Do you think it's worth keeping the Google Maps API logic around?

…eocodio instead of components data

bendnorman · 2025-01-22T19:49:50Z

src/dbcp/transform/geocodio.py

+            locality_type = ad.accuracy_type
+            if locality_type == "place":
+                locality_name = ad.address_components.city
+                locality_type = "city"
+            elif locality_type == "county":
+                locality_name = ad.address_components.county
+            else:
+                locality_name = ""
+            results_df.append(
+                [locality_name, locality_type, ad.address_components.county]
+            )


Might be cleaner to add this logic to the pydantic classes.

Yeah, locality_name could be a computed property, and then results_df could just be:

results_df = pd.DataFrame( [ {"geocoded_locality_name": ad.locality_name, ...} for ad := AddressData.parse_obj(res["results"][0]) for res in results if "results" in res ] )

or the for-loop equivalent since that's a NASTY comprehension.

bendnorman · 2025-01-22T19:51:00Z

src/dbcp/transform/google_maps.py

+
+
+def _geocode_row(
+    ser: pd.Series, client: GoogleGeocoder, state_col="state", locality_col="county"
+) -> List[str]:
+    """Function to pass into pandas df.apply() to geocode state/locality pairs.
+
+    Args:
+        ser (pd.Series): a row of a larger dataframe to geocode
+        client (GoogleGeocoder): client for Google Maps Platform API
+        state_col (str, optional): name of the column of state names. Defaults to 'state'.
+        locality_col (str, optional): name of the column of locality names. Defaults to 'county'.
+
+    Returns:
+        List[str]: geocoded_locality_name, geocoded_locality_type, and geocoded_containing_county
+    """
+    client.geocode_request(name=ser[locality_col], state=ser[state_col])
+    return client.describe()
+
+
+@GEOCODER_CACHE.cache()
+def _geocode_locality(
+    state_locality_df: pd.DataFrame, state_col="state", locality_col="county"
+) -> pd.DataFrame:
+    """Use Google Maps Platform API to look up information about state/locality pairs in a dataframe.
+
+    Args:
+        state_locality_df (pd.DataFrame): dataframe with state and locality columns
+        state_col (str, optional): name of the column of state names. Defaults to 'state'.
+        locality_col (str, optional): name of the column of locality names. Defaults to 'county'.
+
+    Returns:
+        pd.DataFrame: new columns 'geocoded_locality_name', 'geocoded_locality_type', 'geocoded_containing_county'
+    """
+    # NOTE: the purpose of the cache decorator is primarily to
+    # reduce API calls during development. A secondary benefit is to reduce
+    # execution time due to slow synchronous requests.
+    # That's why this is persisted to disk with joblib, not in memory with LRU_cache or something.
+    # Because it is on disk, caching the higher level dataframe function causes less IO overhead
+    # than caching individual API calls would.
+    # Because the entire input dataframe must be identical to the cached version, I
+    # recommend subsetting the dataframe to only state_col and locality_col when calling
+    # this function. That allows other, unrelated columns to change but still use the geocode cache.
+    geocoder = GoogleGeocoder()
+    new_cols = state_locality_df.apply(
+        _geocode_row,
+        axis=1,
+        result_type="expand",
+        client=geocoder,
+        state_col=state_col,
+        locality_col=locality_col,
+    )
+    new_cols.columns = [
+        "geocoded_locality_name",
+        "geocoded_locality_type",
+        "geocoded_containing_county",
+    ]
+    return new_cols


I moved this logic from dbcp.transform.helpers because it is specific to the Google Maps API.

bendnorman · 2025-01-22T19:51:46Z

src/dbcp/transform/helpers.py

-    ser: pd.Series, client: GoogleGeocoder, state_col="state", locality_col="county"
-) -> List[str]:
-    """Function to pass into pandas df.apply() to geocode state/locality pairs.
+def _geocode_and_add_fips(


I moved some of the geocoding logic around so I could compare the Google and Geocodio results.

bendnorman · 2025-01-22T20:00:21Z

src/dbcp/transform/helpers.py

@@ -332,7 +356,8 @@ def add_county_fips_with_backup_geocoding(
        with_fips["geocoded_locality_name"] = with_fips[locality_col]
        with_fips["geocoded_locality_type"] = "county"
        with_fips["geocoded_containing_county"] = with_fips[locality_col]
-        return with_fips
+        # attach to original df


This was a bug. We need to add the original df back to the fipsified dataframe.

bendnorman · 2025-01-22T20:01:49Z

src/dbcp/transform/helpers.py

+    if debug:
+        google_df = _geocode_and_add_fips(
+            nan_fips, state_col=state_col, locality_col=locality_col, api="google"
+        )

-    nan_fips = pd.concat([nan_fips, geocoded], axis=1)
-    # add fips using geocoded names
-    filled_fips = add_fips_ids(
-        nan_fips,
-        state_col=state_col,
-        county_col="geocoded_containing_county",
-        vintage=FIPS_CODE_VINTAGE,
-    )
+        # combine the two geocoded dataframes
+        comp = geocodio_df.merge(
+            google_df,
+            left_index=True,
+            right_index=True,
+            how="outer",
+            validate="1:1",
+            suffixes=("_geocodio", "_google"),
+        )
+
+        county_eq = comp.geocoded_containing_county_geocodio.eq(
+            comp.geocoded_containing_county_google
+        )
+        logger.info("---------------------")
+        logger.info(
+            f"---- pct of geocoded fip failures that don't match: {(~county_eq).sum() / len(comp)}"
+        )
+        logger.info(
+            f"---- pct of all records that don't have the same county: {(~county_eq).sum() / len(state_locality_df)}"
+        )
+        logger.info("---------------------")
+
+    filled_fips = geocodio_df


This is logic to compare Google and Geocodio. I figured we could remove the Google logic once the changes have settled.

jdangerx

Sweet, we're getting there! Some non-blocking suggestions about tightening up the type definitions etc.

It would also maybe be nice to add some tests of the higher-level functionality beyond just "does the API return what we expect?" (which is also a valuable test!)

I think it's fine to leave the Google code in, I would probably pull it out once we've satisfied ourselves with the comparisons.

I do have a blocking question about the validation tests - is this the right interpretation of the results?

Out of all N locations in the gridstatus ETL, there were M locations that didn't have a FIPS code from addfips. Of those M, 24% of those had a different FIPS code when using Geocodio vs. Google. 2% of the N total locations had a mismatched FIPS code. Which also means that about M is about 1/12 of N.
Also, for the offshore wind data, that means that none of the original locations got their FIPS code from addfips - and 10% of the geocoded results are different between Geocodio and Google.

If that's all true, the diff does seem kind of high to me, but also causes a pretty small absolute change in codes. I would like to see, for instances where Google and Geocodio disagree, what the actual disagreement is - are they geocoding to counties that are right next to each other? Where is the actual lat/long, is it close to a county edge?

src/dbcp/transform/geocodio.py

jdangerx · 2025-01-23T20:11:54Z

src/dbcp/transform/geocodio.py

+from joblib import Memory
+from pydantic import BaseModel
+
+geocoder_local_cache = Path("/app/data/geocodio_cache")


This sort of path seems ripe for configuration with an env variable instead of hard-coding. e.g. I don't have /app on my computer.

True! We've been hard coding these paths because this ETL should only be run in a docker container which has a consistent file structure. I think it'd be wise to use an env var anyway! There are lots of locations in the code that reference /app/data so I'll make this change in a separate PR.

jdangerx · 2025-01-23T20:13:14Z

src/dbcp/transform/geocodio.py

+class Location(BaseModel):
+    """Location from Geocodio."""
+
+    lat: float = 0.0


Do we ever expect there to be default values here? Seems like we want to enforce that we always get a lat and a long, and never just fill a missing value with 0...

Good point! Removed the default value for these attributes.

jdangerx · 2025-01-23T20:13:54Z

src/dbcp/transform/geocodio.py

+class AddressComponents(BaseModel):
+    """Address components from Geocodio."""
+
+    number: str = ""


Are there any of these fields we would always expect? Country maybe?

I don't think the docs say anything about values that will always be present.

jdangerx · 2025-01-23T20:28:06Z

src/dbcp/transform/geocodio.py

+    GEOCODIO_API_KEY = os.environ["GEOCODIO_API_KEY"]
+    client = GeocodioClient(GEOCODIO_API_KEY)
+
+    geocoded_df = []


nit: I find it confusing to have variables that aren't pd.DataFrame that end in _df. This happens in the batch munging code above too.

jdangerx · 2025-01-23T20:29:04Z

src/dbcp/transform/google_maps.py


 logger = getLogger("__name__")


+geocoder_local_cache = Path("/app/data/google_geocoder_cache")


Same comments in this block as before - this hardcoded path seems ripe for env vars, and 2**19 B is not 100 KB.

jdangerx · 2025-01-23T21:58:18Z

src/dbcp/transform/helpers.py


+    Args:
+        nan_fips: dataframe with state and locality columns


nan_fips is a dataframe only containing the state/locality combos that don't have FIPS from addfips, right?

jdangerx · 2025-01-23T22:05:28Z

src/dbcp/transform/helpers.py

-        state_col (str, optional): name of the column of state names. Defaults to 'state'.
-        locality_col (str, optional): name of the column of locality names. Defaults to 'county'.
+    # recombine deduped geocoded data with original nan_fips
+    geocoded_deduped_nan_fips = pd.concat(


Is the reason we put the empty rows in while geocoding so that this concat can still work?

Yes it's so we can preserve the index. Do you think it'd be cleaner to not add empty rows?

Yeah, I think so, and intuitively seems like it would make the joining a little simpler. But it's not a blocking concern I think - refactoring would require (a) adding tests for this behavior and then (b) figuring out what sequence of joins etc. works for this use case, so it's not like it's a 5-minute dealio. Might be a good follow-up PR though.

jdangerx · 2025-01-23T22:09:05Z

test/unit/test_geocoding.py

Should we also test add_county_fips_with_backup_geocoding to make sure that we're combining the addfips results with geocoding results properly?

bendnorman · 2025-01-24T18:29:27Z

You understanding of the metrics is correct. I can do some more digging as to why google and geocodio produce different results but I couldn't find any consistent patterns.

What were you thinking for additional tests? Here are some ideas:

Make sure we're catching geocodio authentication errors properly
The geocodio pydantic class types are working as expected
You suggestion: test we're combining the addfips and geocoded results correctly

jdangerx · 2025-01-24T21:24:47Z

OK, if you've already done that digging I'm good to merge - we should probably mention that to DBCP to see if they want us to dig more or not.

As for more tests, I do think that testing add_county_fips_with_backup_geocoding is a good thing to do since it pulls a bunch of different functionality together and is a commonly used internal library function. I'm good with that being in this PR or as a follow-up PR.

… duplicate code in the function

bendnorman · 2025-01-28T18:29:55Z

Good news! In the previous validation analysis, I compared county names instead of fips codes. Turns out Geocodio and Google are geocoding to the correct entity but they use different spellings. For example, in the GS and LBNL data there were about a hundred locations in Virginia that geocode to independent cities. Google geocodes "City of Chesapeake" to "Chesapeake City" and Geocodio geocodes it to "Chesapeake". Add fips returns the same fips code for both of these. The new results are much more acceptable.

dataset	% of locations that don’t have matching fips codes after addfips + geocoding	% of locations that needed geocoded that don’t have matching fips codes
offshore_wind	3%	3%
gridstatus	<1%	4%
eip_infrastructure	<1%	8%
columbia_local_opp	4%	10%
nrel_wind_solar_ordinances	<1%	<1%
lbnl_iso_queue	<1%	7%

bendnorman added 5 commits January 15, 2025 13:11

Add geocodio as geocoder

f6bb956

Clean up geocodio and google maps api logic

4724e9e

Add tests, debug option to geocoding and use full address field for g…

8d4c579

…eocodio instead of components data

Create geocoding cache directories if they don't exist

9a0a22f

Add google maps api key back to github actions

baaa91d

bendnorman commented Jan 22, 2025

View reviewed changes

bendnorman requested a review from jdangerx January 22, 2025 20:48

Add geocodio exception and rmi comment

6820ac3

bendnorman marked this pull request as ready for review January 22, 2025 21:00

jdangerx requested changes Jan 23, 2025

View reviewed changes

bendnorman added 2 commits January 24, 2025 13:16

Create stronger types for geocodio classes, clean up parsing logic

e7d7400

Clean up variable names in geocodio module

c34d8f4

Set locality type to NOne if not a city or county

8bed81f

jdangerx self-requested a review January 24, 2025 21:25

jdangerx approved these changes Jan 24, 2025

View reviewed changes

Add tests for add_county_fips_with_backup_geocoding and clean up some…

ffc857f

… duplicate code in the function

Compare fips instead of county names in geocoding debug

46724d6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use geocodio instead of google maps #394

Use geocodio instead of google maps #394

bendnorman commented Jan 22, 2025 •

edited

Loading

bendnorman Jan 22, 2025

jdangerx Jan 23, 2025

bendnorman Jan 22, 2025

bendnorman Jan 22, 2025

bendnorman Jan 22, 2025

bendnorman Jan 22, 2025

jdangerx left a comment •

edited

Loading

jdangerx Jan 23, 2025

bendnorman Jan 24, 2025

jdangerx Jan 23, 2025

bendnorman Jan 24, 2025

jdangerx Jan 23, 2025

bendnorman Jan 24, 2025

jdangerx Jan 23, 2025

jdangerx Jan 23, 2025

jdangerx Jan 23, 2025

jdangerx Jan 23, 2025

bendnorman Jan 24, 2025

jdangerx Jan 24, 2025

jdangerx Jan 23, 2025

bendnorman commented Jan 24, 2025 •

edited

Loading

jdangerx commented Jan 24, 2025

bendnorman commented Jan 28, 2025


		logger = getLogger("__name__")


		geocoder_local_cache = Path("/app/data/google_geocoder_cache")

Use geocodio instead of google maps #394

Are you sure you want to change the base?

Use geocodio instead of google maps #394

Conversation

bendnorman commented Jan 22, 2025 • edited Loading

Validation checks

Questions

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdangerx left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bendnorman commented Jan 24, 2025 • edited Loading

jdangerx commented Jan 24, 2025

bendnorman commented Jan 28, 2025

bendnorman commented Jan 22, 2025 •

edited

Loading

jdangerx left a comment •

edited

Loading

bendnorman commented Jan 24, 2025 •

edited

Loading