Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use geocodio instead of google maps #394

Open
wants to merge 11 commits into
base: dev
Choose a base branch
from
Open

Use geocodio instead of google maps #394

wants to merge 11 commits into from

Conversation

bendnorman
Copy link
Contributor

@bendnorman bendnorman commented Jan 22, 2025

This PR replaces the Google Maps geocoder with the Geocodio API. Geocodio is cheaper and allows us to cache API calls.

Validation checks

  • 97% of counties in counties_wide_format have the same renewable_and_battery_proposed_capacity_mw when geocoded with Google and Geocodio.
  • Here are some additional metrics for ETLs that require geocoding. This table compares # of locations that have the same fips code after being geocoded by Google and Geocodio:
dataset % of locations that don’t have matching fips codes after addfips + geocoding % of locations that needed geocoded that don’t have matching fips codes
offshore_wind 10% 10%
gridstatus 2% 24%
eip_infrastructure 1% 15%
columbia_local_opp 4% 10%
nrel_wind_solar_ordinances 1% 18%
lbnl_iso_queue 1% 36%

These all feel like acceptable differences to me. Typically, a small percentage of locations need to be geocoded because addFips does most of the work. The offshore wind locations might be worth looking into but we're about to update this dataset so I figured we could do it then.

Questions

  • Do you think these validation checks are reasonable? Anything you think I should dig into a little more?
  • Do you think it's worth keeping the Google Maps API logic around?

Comment on lines 76 to 86
locality_type = ad.accuracy_type
if locality_type == "place":
locality_name = ad.address_components.city
locality_type = "city"
elif locality_type == "county":
locality_name = ad.address_components.county
else:
locality_name = ""
results_df.append(
[locality_name, locality_type, ad.address_components.county]
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be cleaner to add this logic to the pydantic classes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, locality_name could be a computed property, and then results_df could just be:

results_df = pd.DataFrame(
    [
        {"geocoded_locality_name": ad.locality_name, ...}
        for ad := AddressData.parse_obj(res["results"][0])
        for res in results if "results" in res
    ]
)

or the for-loop equivalent since that's a NASTY comprehension.

Comment on lines +217 to +274


def _geocode_row(
ser: pd.Series, client: GoogleGeocoder, state_col="state", locality_col="county"
) -> List[str]:
"""Function to pass into pandas df.apply() to geocode state/locality pairs.

Args:
ser (pd.Series): a row of a larger dataframe to geocode
client (GoogleGeocoder): client for Google Maps Platform API
state_col (str, optional): name of the column of state names. Defaults to 'state'.
locality_col (str, optional): name of the column of locality names. Defaults to 'county'.

Returns:
List[str]: geocoded_locality_name, geocoded_locality_type, and geocoded_containing_county
"""
client.geocode_request(name=ser[locality_col], state=ser[state_col])
return client.describe()


@GEOCODER_CACHE.cache()
def _geocode_locality(
state_locality_df: pd.DataFrame, state_col="state", locality_col="county"
) -> pd.DataFrame:
"""Use Google Maps Platform API to look up information about state/locality pairs in a dataframe.

Args:
state_locality_df (pd.DataFrame): dataframe with state and locality columns
state_col (str, optional): name of the column of state names. Defaults to 'state'.
locality_col (str, optional): name of the column of locality names. Defaults to 'county'.

Returns:
pd.DataFrame: new columns 'geocoded_locality_name', 'geocoded_locality_type', 'geocoded_containing_county'
"""
# NOTE: the purpose of the cache decorator is primarily to
# reduce API calls during development. A secondary benefit is to reduce
# execution time due to slow synchronous requests.
# That's why this is persisted to disk with joblib, not in memory with LRU_cache or something.
# Because it is on disk, caching the higher level dataframe function causes less IO overhead
# than caching individual API calls would.
# Because the entire input dataframe must be identical to the cached version, I
# recommend subsetting the dataframe to only state_col and locality_col when calling
# this function. That allows other, unrelated columns to change but still use the geocode cache.
geocoder = GoogleGeocoder()
new_cols = state_locality_df.apply(
_geocode_row,
axis=1,
result_type="expand",
client=geocoder,
state_col=state_col,
locality_col=locality_col,
)
new_cols.columns = [
"geocoded_locality_name",
"geocoded_locality_type",
"geocoded_containing_county",
]
return new_cols
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this logic from dbcp.transform.helpers because it is specific to the Google Maps API.

ser: pd.Series, client: GoogleGeocoder, state_col="state", locality_col="county"
) -> List[str]:
"""Function to pass into pandas df.apply() to geocode state/locality pairs.
def _geocode_and_add_fips(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved some of the geocoding logic around so I could compare the Google and Geocodio results.

@@ -332,7 +356,8 @@ def add_county_fips_with_backup_geocoding(
with_fips["geocoded_locality_name"] = with_fips[locality_col]
with_fips["geocoded_locality_type"] = "county"
with_fips["geocoded_containing_county"] = with_fips[locality_col]
return with_fips
# attach to original df
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug. We need to add the original df back to the fipsified dataframe.

Comment on lines 375 to 402
if debug:
google_df = _geocode_and_add_fips(
nan_fips, state_col=state_col, locality_col=locality_col, api="google"
)

nan_fips = pd.concat([nan_fips, geocoded], axis=1)
# add fips using geocoded names
filled_fips = add_fips_ids(
nan_fips,
state_col=state_col,
county_col="geocoded_containing_county",
vintage=FIPS_CODE_VINTAGE,
)
# combine the two geocoded dataframes
comp = geocodio_df.merge(
google_df,
left_index=True,
right_index=True,
how="outer",
validate="1:1",
suffixes=("_geocodio", "_google"),
)

county_eq = comp.geocoded_containing_county_geocodio.eq(
comp.geocoded_containing_county_google
)
logger.info("---------------------")
logger.info(
f"---- pct of geocoded fip failures that don't match: {(~county_eq).sum() / len(comp)}"
)
logger.info(
f"---- pct of all records that don't have the same county: {(~county_eq).sum() / len(state_locality_df)}"
)
logger.info("---------------------")

filled_fips = geocodio_df
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is logic to compare Google and Geocodio. I figured we could remove the Google logic once the changes have settled.

@bendnorman bendnorman requested a review from jdangerx January 22, 2025 20:48
@bendnorman bendnorman marked this pull request as ready for review January 22, 2025 21:00
Copy link
Collaborator

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet, we're getting there! Some non-blocking suggestions about tightening up the type definitions etc.

It would also maybe be nice to add some tests of the higher-level functionality beyond just "does the API return what we expect?" (which is also a valuable test!)

I think it's fine to leave the Google code in, I would probably pull it out once we've satisfied ourselves with the comparisons.

I do have a blocking question about the validation tests - is this the right interpretation of the results?

  • Out of all N locations in the gridstatus ETL, there were M locations that didn't have a FIPS code from addfips. Of those M, 24% of those had a different FIPS code when using Geocodio vs. Google. 2% of the N total locations had a mismatched FIPS code. Which also means that about M is about 1/12 of N.

  • Also, for the offshore wind data, that means that none of the original locations got their FIPS code from addfips - and 10% of the geocoded results are different between Geocodio and Google.

If that's all true, the diff does seem kind of high to me, but also causes a pretty small absolute change in codes. I would like to see, for instances where Google and Geocodio disagree, what the actual disagreement is - are they geocoding to counties that are right next to each other? Where is the actual lat/long, is it close to a county edge?

src/dbcp/transform/geocodio.py Show resolved Hide resolved
src/dbcp/transform/geocodio.py Show resolved Hide resolved
from joblib import Memory
from pydantic import BaseModel

geocoder_local_cache = Path("/app/data/geocodio_cache")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of path seems ripe for configuration with an env variable instead of hard-coding. e.g. I don't have /app on my computer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True! We've been hard coding these paths because this ETL should only be run in a docker container which has a consistent file structure. I think it'd be wise to use an env var anyway! There are lots of locations in the code that reference /app/data so I'll make this change in a separate PR.

class Location(BaseModel):
"""Location from Geocodio."""

lat: float = 0.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever expect there to be default values here? Seems like we want to enforce that we always get a lat and a long, and never just fill a missing value with 0...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Removed the default value for these attributes.

class AddressComponents(BaseModel):
"""Address components from Geocodio."""

number: str = ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any of these fields we would always expect? Country maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the docs say anything about values that will always be present.

GEOCODIO_API_KEY = os.environ["GEOCODIO_API_KEY"]
client = GeocodioClient(GEOCODIO_API_KEY)

geocoded_df = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I find it confusing to have variables that aren't pd.DataFrame that end in _df. This happens in the batch munging code above too.


logger = getLogger("__name__")


geocoder_local_cache = Path("/app/data/google_geocoder_cache")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments in this block as before - this hardcoded path seems ripe for env vars, and 2**19 B is not 100 KB.


Args:
nan_fips: dataframe with state and locality columns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nan_fips is a dataframe only containing the state/locality combos that don't have FIPS from addfips, right?

state_col (str, optional): name of the column of state names. Defaults to 'state'.
locality_col (str, optional): name of the column of locality names. Defaults to 'county'.
# recombine deduped geocoded data with original nan_fips
geocoded_deduped_nan_fips = pd.concat(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reason we put the empty rows in while geocoding so that this concat can still work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's so we can preserve the index. Do you think it'd be cleaner to not add empty rows?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think so, and intuitively seems like it would make the joining a little simpler. But it's not a blocking concern I think - refactoring would require (a) adding tests for this behavior and then (b) figuring out what sequence of joins etc. works for this use case, so it's not like it's a 5-minute dealio. Might be a good follow-up PR though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also test add_county_fips_with_backup_geocoding to make sure that we're combining the addfips results with geocoding results properly?

@bendnorman
Copy link
Contributor Author

bendnorman commented Jan 24, 2025

You understanding of the metrics is correct. I can do some more digging as to why google and geocodio produce different results but I couldn't find any consistent patterns.

What were you thinking for additional tests? Here are some ideas:

  • Make sure we're catching geocodio authentication errors properly
  • The geocodio pydantic class types are working as expected
  • You suggestion: test we're combining the addfips and geocoded results correctly

@jdangerx
Copy link
Collaborator

OK, if you've already done that digging I'm good to merge - we should probably mention that to DBCP to see if they want us to dig more or not.

As for more tests, I do think that testing add_county_fips_with_backup_geocoding is a good thing to do since it pulls a bunch of different functionality together and is a commonly used internal library function. I'm good with that being in this PR or as a follow-up PR.

@jdangerx jdangerx self-requested a review January 24, 2025 21:25
@bendnorman
Copy link
Contributor Author

Good news! In the previous validation analysis, I compared county names instead of fips codes. Turns out Geocodio and Google are geocoding to the correct entity but they use different spellings. For example, in the GS and LBNL data there were about a hundred locations in Virginia that geocode to independent cities. Google geocodes "City of Chesapeake" to "Chesapeake City" and Geocodio geocodes it to "Chesapeake". Add fips returns the same fips code for both of these. The new results are much more acceptable.

dataset % of locations that don’t have matching fips codes after addfips + geocoding % of locations that needed geocoded that don’t have matching fips codes
offshore_wind 3% 3%
gridstatus <1% 4%
eip_infrastructure <1% 8%
columbia_local_opp 4% 10%
nrel_wind_solar_ordinances <1% <1%
lbnl_iso_queue <1% 7%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants