You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
About 100 duplicate (project_id, county_id_fips) entries are produced in the gridstatus_locations table due to non-standard formatting of raw place names. These are not whole row duplicates. They occur for a couple of reasons:
the raw county name contains a city, county pair like Roswell, Chaves County. The processing code is designed to treat delimiters as separating two separate locations, so a city, county entry gets erroneously split in two. Usually both pieces get geocoded back to the same county FIPS code, but this is not always the case for degenerate place names.
the raw county name contains a delimited list of cities that are in the same county. These get split and geocoded to the same FIPS code
the raw county name contains two versions of the same county name ('Pointe Coupee, Pointe Coupee Parish'). All the current instances of this are in Louisiana for some reason.
a few projects seem to cross the NY/NJ state line, but have raw county names 'NJ, NY' with the raw state 'NY'. So they all get mapped to New York County, NY.
The impact of this duplication is fairly minor. Thanks to capacity allocation, the total MW are unchanged. But the duplicate county_id_fips will double count the number of projects within a county in the wide format data mart table. I think either the duplicates should be removed in downstream queries or the agg func in dbcp/data_mart/counties.py:407 needs to be changed from "project_id": "count" to "project_id": "nunique"
The text was updated successfully, but these errors were encountered:
About 100 duplicate (
project_id
,county_id_fips
) entries are produced in thegridstatus_locations
table due to non-standard formatting of raw place names. These are not whole row duplicates. They occur for a couple of reasons:Roswell, Chaves County
. The processing code is designed to treat delimiters as separating two separate locations, so a city, county entry gets erroneously split in two. Usually both pieces get geocoded back to the same county FIPS code, but this is not always the case for degenerate place names.'Pointe Coupee, Pointe Coupee Parish'
). All the current instances of this are in Louisiana for some reason.'NJ, NY'
with the raw state'NY'
. So they all get mapped to New York County, NY.The impact of this duplication is fairly minor. Thanks to capacity allocation, the total MW are unchanged. But the duplicate county_id_fips will double count the number of projects within a county in the wide format data mart table. I think either the duplicates should be removed in downstream queries or the agg func in
dbcp/data_mart/counties.py:407
needs to be changed from"project_id": "count"
to"project_id": "nunique"
The text was updated successfully, but these errors were encountered: