Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve duplicate queue projects due to delimiters in raw county names #335

Open
TrentonBush opened this issue May 20, 2024 · 1 comment

Comments

@TrentonBush
Copy link
Collaborator

TrentonBush commented May 20, 2024

About 100 duplicate (project_id, county_id_fips) entries are produced in the gridstatus_locations table due to non-standard formatting of raw place names. These are not whole row duplicates. They occur for a couple of reasons:

  • the raw county name contains a city, county pair like Roswell, Chaves County. The processing code is designed to treat delimiters as separating two separate locations, so a city, county entry gets erroneously split in two. Usually both pieces get geocoded back to the same county FIPS code, but this is not always the case for degenerate place names.
  • the raw county name contains a delimited list of cities that are in the same county. These get split and geocoded to the same FIPS code
  • the raw county name contains two versions of the same county name ('Pointe Coupee, Pointe Coupee Parish'). All the current instances of this are in Louisiana for some reason.
  • a few projects seem to cross the NY/NJ state line, but have raw county names 'NJ, NY' with the raw state 'NY'. So they all get mapped to New York County, NY.

The impact of this duplication is fairly minor. Thanks to capacity allocation, the total MW are unchanged. But the duplicate county_id_fips will double count the number of projects within a county in the wide format data mart table. I think either the duplicates should be removed in downstream queries or the agg func in dbcp/data_mart/counties.py:407 needs to be changed from "project_id": "count" to "project_id": "nunique"

@TrentonBush
Copy link
Collaborator Author

duplicate_locations = geocoded_locations[
        geocoded_locations[["county_id_fips", "project_id"]].duplicated(keep=False)
    ]
duplicate_locations.groupby(['project_id', 'county_id_fips'])['geocoded_locality_type'].agg(lambda x: set(x)).value_counts()

outputs:

{county, city}    17
{county}          10
{city}             5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant