-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use geocodio instead of google maps #394
base: dev
Are you sure you want to change the base?
Conversation
…eocodio instead of components data
src/dbcp/transform/geocodio.py
Outdated
locality_type = ad.accuracy_type | ||
if locality_type == "place": | ||
locality_name = ad.address_components.city | ||
locality_type = "city" | ||
elif locality_type == "county": | ||
locality_name = ad.address_components.county | ||
else: | ||
locality_name = "" | ||
results_df.append( | ||
[locality_name, locality_type, ad.address_components.county] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be cleaner to add this logic to the pydantic classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, locality_name
could be a computed property, and then results_df
could just be:
results_df = pd.DataFrame(
[
{"geocoded_locality_name": ad.locality_name, ...}
for ad := AddressData.parse_obj(res["results"][0])
for res in results if "results" in res
]
)
or the for-loop equivalent since that's a NASTY comprehension.
|
||
|
||
def _geocode_row( | ||
ser: pd.Series, client: GoogleGeocoder, state_col="state", locality_col="county" | ||
) -> List[str]: | ||
"""Function to pass into pandas df.apply() to geocode state/locality pairs. | ||
|
||
Args: | ||
ser (pd.Series): a row of a larger dataframe to geocode | ||
client (GoogleGeocoder): client for Google Maps Platform API | ||
state_col (str, optional): name of the column of state names. Defaults to 'state'. | ||
locality_col (str, optional): name of the column of locality names. Defaults to 'county'. | ||
|
||
Returns: | ||
List[str]: geocoded_locality_name, geocoded_locality_type, and geocoded_containing_county | ||
""" | ||
client.geocode_request(name=ser[locality_col], state=ser[state_col]) | ||
return client.describe() | ||
|
||
|
||
@GEOCODER_CACHE.cache() | ||
def _geocode_locality( | ||
state_locality_df: pd.DataFrame, state_col="state", locality_col="county" | ||
) -> pd.DataFrame: | ||
"""Use Google Maps Platform API to look up information about state/locality pairs in a dataframe. | ||
|
||
Args: | ||
state_locality_df (pd.DataFrame): dataframe with state and locality columns | ||
state_col (str, optional): name of the column of state names. Defaults to 'state'. | ||
locality_col (str, optional): name of the column of locality names. Defaults to 'county'. | ||
|
||
Returns: | ||
pd.DataFrame: new columns 'geocoded_locality_name', 'geocoded_locality_type', 'geocoded_containing_county' | ||
""" | ||
# NOTE: the purpose of the cache decorator is primarily to | ||
# reduce API calls during development. A secondary benefit is to reduce | ||
# execution time due to slow synchronous requests. | ||
# That's why this is persisted to disk with joblib, not in memory with LRU_cache or something. | ||
# Because it is on disk, caching the higher level dataframe function causes less IO overhead | ||
# than caching individual API calls would. | ||
# Because the entire input dataframe must be identical to the cached version, I | ||
# recommend subsetting the dataframe to only state_col and locality_col when calling | ||
# this function. That allows other, unrelated columns to change but still use the geocode cache. | ||
geocoder = GoogleGeocoder() | ||
new_cols = state_locality_df.apply( | ||
_geocode_row, | ||
axis=1, | ||
result_type="expand", | ||
client=geocoder, | ||
state_col=state_col, | ||
locality_col=locality_col, | ||
) | ||
new_cols.columns = [ | ||
"geocoded_locality_name", | ||
"geocoded_locality_type", | ||
"geocoded_containing_county", | ||
] | ||
return new_cols |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this logic from dbcp.transform.helpers
because it is specific to the Google Maps API.
ser: pd.Series, client: GoogleGeocoder, state_col="state", locality_col="county" | ||
) -> List[str]: | ||
"""Function to pass into pandas df.apply() to geocode state/locality pairs. | ||
def _geocode_and_add_fips( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved some of the geocoding logic around so I could compare the Google and Geocodio results.
src/dbcp/transform/helpers.py
Outdated
@@ -332,7 +356,8 @@ def add_county_fips_with_backup_geocoding( | |||
with_fips["geocoded_locality_name"] = with_fips[locality_col] | |||
with_fips["geocoded_locality_type"] = "county" | |||
with_fips["geocoded_containing_county"] = with_fips[locality_col] | |||
return with_fips | |||
# attach to original df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bug. We need to add the original df back to the fipsified dataframe.
src/dbcp/transform/helpers.py
Outdated
if debug: | ||
google_df = _geocode_and_add_fips( | ||
nan_fips, state_col=state_col, locality_col=locality_col, api="google" | ||
) | ||
|
||
nan_fips = pd.concat([nan_fips, geocoded], axis=1) | ||
# add fips using geocoded names | ||
filled_fips = add_fips_ids( | ||
nan_fips, | ||
state_col=state_col, | ||
county_col="geocoded_containing_county", | ||
vintage=FIPS_CODE_VINTAGE, | ||
) | ||
# combine the two geocoded dataframes | ||
comp = geocodio_df.merge( | ||
google_df, | ||
left_index=True, | ||
right_index=True, | ||
how="outer", | ||
validate="1:1", | ||
suffixes=("_geocodio", "_google"), | ||
) | ||
|
||
county_eq = comp.geocoded_containing_county_geocodio.eq( | ||
comp.geocoded_containing_county_google | ||
) | ||
logger.info("---------------------") | ||
logger.info( | ||
f"---- pct of geocoded fip failures that don't match: {(~county_eq).sum() / len(comp)}" | ||
) | ||
logger.info( | ||
f"---- pct of all records that don't have the same county: {(~county_eq).sum() / len(state_locality_df)}" | ||
) | ||
logger.info("---------------------") | ||
|
||
filled_fips = geocodio_df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is logic to compare Google and Geocodio. I figured we could remove the Google logic once the changes have settled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet, we're getting there! Some non-blocking suggestions about tightening up the type definitions etc.
It would also maybe be nice to add some tests of the higher-level functionality beyond just "does the API return what we expect?" (which is also a valuable test!)
I think it's fine to leave the Google code in, I would probably pull it out once we've satisfied ourselves with the comparisons.
I do have a blocking question about the validation tests - is this the right interpretation of the results?
-
Out of all N locations in the gridstatus ETL, there were M locations that didn't have a FIPS code from
addfips
. Of those M, 24% of those had a different FIPS code when using Geocodio vs. Google. 2% of the N total locations had a mismatched FIPS code. Which also means that about M is about 1/12 of N. -
Also, for the offshore wind data, that means that none of the original locations got their FIPS code from
addfips
- and 10% of the geocoded results are different between Geocodio and Google.
If that's all true, the diff does seem kind of high to me, but also causes a pretty small absolute change in codes. I would like to see, for instances where Google and Geocodio disagree, what the actual disagreement is - are they geocoding to counties that are right next to each other? Where is the actual lat/long, is it close to a county edge?
from joblib import Memory | ||
from pydantic import BaseModel | ||
|
||
geocoder_local_cache = Path("/app/data/geocodio_cache") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sort of path seems ripe for configuration with an env variable instead of hard-coding. e.g. I don't have /app
on my computer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True! We've been hard coding these paths because this ETL should only be run in a docker container which has a consistent file structure. I think it'd be wise to use an env var anyway! There are lots of locations in the code that reference /app/data
so I'll make this change in a separate PR.
src/dbcp/transform/geocodio.py
Outdated
class Location(BaseModel): | ||
"""Location from Geocodio.""" | ||
|
||
lat: float = 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we ever expect there to be default values here? Seems like we want to enforce that we always get a lat and a long, and never just fill a missing value with 0...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Removed the default value for these attributes.
src/dbcp/transform/geocodio.py
Outdated
class AddressComponents(BaseModel): | ||
"""Address components from Geocodio.""" | ||
|
||
number: str = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any of these fields we would always expect? Country maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the docs say anything about values that will always be present.
src/dbcp/transform/geocodio.py
Outdated
GEOCODIO_API_KEY = os.environ["GEOCODIO_API_KEY"] | ||
client = GeocodioClient(GEOCODIO_API_KEY) | ||
|
||
geocoded_df = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I find it confusing to have variables that aren't pd.DataFrame
that end in _df
. This happens in the batch munging code above too.
|
||
logger = getLogger("__name__") | ||
|
||
|
||
geocoder_local_cache = Path("/app/data/google_geocoder_cache") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments in this block as before - this hardcoded path seems ripe for env vars, and 2**19 B is not 100 KB.
|
||
Args: | ||
nan_fips: dataframe with state and locality columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nan_fips is a dataframe only containing the state/locality combos that don't have FIPS from addfips
, right?
state_col (str, optional): name of the column of state names. Defaults to 'state'. | ||
locality_col (str, optional): name of the column of locality names. Defaults to 'county'. | ||
# recombine deduped geocoded data with original nan_fips | ||
geocoded_deduped_nan_fips = pd.concat( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reason we put the empty rows in while geocoding so that this concat can still work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's so we can preserve the index. Do you think it'd be cleaner to not add empty rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think so, and intuitively seems like it would make the joining a little simpler. But it's not a blocking concern I think - refactoring would require (a) adding tests for this behavior and then (b) figuring out what sequence of joins etc. works for this use case, so it's not like it's a 5-minute dealio. Might be a good follow-up PR though.
test/unit/test_geocoding.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also test add_county_fips_with_backup_geocoding
to make sure that we're combining the addfips results with geocoding results properly?
You understanding of the metrics is correct. I can do some more digging as to why google and geocodio produce different results but I couldn't find any consistent patterns. What were you thinking for additional tests? Here are some ideas:
|
OK, if you've already done that digging I'm good to merge - we should probably mention that to DBCP to see if they want us to dig more or not. As for more tests, I do think that testing |
… duplicate code in the function
Good news! In the previous validation analysis, I compared county names instead of fips codes. Turns out Geocodio and Google are geocoding to the correct entity but they use different spellings. For example, in the GS and LBNL data there were about a hundred locations in Virginia that geocode to independent cities. Google geocodes "City of Chesapeake" to "Chesapeake City" and Geocodio geocodes it to "Chesapeake". Add fips returns the same fips code for both of these. The new results are much more acceptable.
|
This PR replaces the Google Maps geocoder with the Geocodio API. Geocodio is cheaper and allows us to cache API calls.
Validation checks
counties_wide_format
have the samerenewable_and_battery_proposed_capacity_mw
when geocoded with Google and Geocodio.These all feel like acceptable differences to me. Typically, a small percentage of locations need to be geocoded because
addFips
does most of the work. The offshore wind locations might be worth looking into but we're about to update this dataset so I figured we could do it then.Questions