Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTC-3081: Add political/id-lookup endpoint #616

Merged
merged 48 commits into from
Jan 29, 2025
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
b94f3fa
Initial attempt at stub route for geoencoder
dmannarino Dec 17, 2024
505cc16
Turn URL params into query params; add boundary version, source;
dmannarino Dec 18, 2024
0514d42
Initial test, not expected to pass
dmannarino Dec 18, 2024
9d2ae13
Disable geoencoder tests to get it to deploy
dmannarino Dec 19, 2024
44a253f
WIP: Run a query instead of canned response
dmannarino Dec 19, 2024
9c2e0ca
WIP: refactor to make better use of helper fcns
dmannarino Dec 20, 2024
899e772
More error handling and tests
dmannarino Dec 20, 2024
d633596
Correct region/subregion->name fields
dmannarino Dec 21, 2024
7f5e9f3
Correct case of multiple WHEREs
dmannarino Dec 21, 2024
2f2facd
Update pre-commit packages, disable docformatter until it's fixed
dmannarino Dec 21, 2024
15de81a
Too much, TBH: Add limiting query to specified admin level; enforce s…
dmannarino Dec 23, 2024
40f7772
pipenv finally updated packages; add unidecode for geoencode endpoint
dmannarino Dec 26, 2024
2cee550
Optionally unaccent names in request to geoencode endpoint
dmannarino Dec 26, 2024
e07c4f4
Update lockfile for new raterio/numpy
dmannarino Dec 26, 2024
bb69f18
Don't pass Nones to unidecode
dmannarino Dec 26, 2024
1b95ca2
Actually search the unaccented columns
dmannarino Dec 27, 2024
79ae7c3
Add output example from ticket as a test, and adjust code to pass
dmannarino Dec 27, 2024
09e628e
Get regular fields, not unaccented ones
dmannarino Dec 27, 2024
2934fd6
Fix bug introduced in last commit: GET name fields, MATCH on (potenti…
dmannarino Dec 27, 2024
e09cf01
Add a test for getting from unaccented fields
dmannarino Dec 27, 2024
2d979a2
Hide extraneous fields
dmannarino Dec 27, 2024
0c6d541
Add docstrings, add a test, and slightly improve error message on non…
dmannarino Dec 27, 2024
ed5f2cd
Decapitalize as part of normalization; add tests
dmannarino Dec 27, 2024
34c41f8
Return GIDs as relative, not complete values. eg: GID_1=1 instead of …
dmannarino Dec 28, 2024
1952fc0
Minor doc addition
dmannarino Dec 28, 2024
c6384fd
Merge branch 'develop' into gtc-3081_geoencoder_endpoint
dmannarino Dec 28, 2024
4aef63c
Merge branch 'develop' into gtc-3081_geoencoder_endpoint
dmannarino Jan 7, 2025
0f80b9e
WIP: Move geoencode query params into a model in order to implement a…
dmannarino Jan 12, 2025
b23bf7f
Fix resolving version to string
dmannarino Jan 12, 2025
275ff6e
WIP: Temporarily include geoencode route in docs
dmannarino Jan 12, 2025
53ddba5
Fix for last commit: PRepend 'v' to version string, again.
dmannarino Jan 13, 2025
3bc92f5
WIP: Add lookup_admin_source_version helper, duplicating some code
dmannarino Jan 13, 2025
fbacd70
Raise ValueErrors instead of AssertionErrors on bad params
dmannarino Jan 13, 2025
9c5fc87
After much pain and gnashing of teeth, get validator working again
dmannarino Jan 14, 2025
849d68a
Add models for Geoencoder responses and children
dmannarino Jan 14, 2025
4668790
Use dolar sign quoting to avoid PostgreSQL complaining about apostrop…
dmannarino Jan 14, 2025
68f6590
Add type hint per Dan's suggestion
dmannarino Jan 21, 2025
d35bb02
Re-enable docformatter precommit hook
dmannarino Jan 24, 2025
0bdb3e5
Merge branch 'develop' into gtc-3081_geoencoder_endpoint
dmannarino Jan 24, 2025
a9cf2df
Improve error messages
dmannarino Jan 24, 2025
83aa0a0
Move geoencoder to /political/geoencoder
dmannarino Jan 24, 2025
0effc5b
Break forming Geoencoder response out into a helper
dmannarino Jan 24, 2025
84bf869
Rename geoencoder endpoint to id-lookup
dmannarino Jan 24, 2025
af3c244
Set version of GADM 4.1 in various environments
dmannarino Jan 27, 2025
498ecc6
Implement Gary's suggestions to rename from geoencoder -> admin id lo…
dmannarino Jan 27, 2025
d9e0cda
Use AdminIDLookupResponseData properly and add a few type hints
dmannarino Jan 28, 2025
55e1a2f
Merge branch 'develop' into gtc-3081_geoencoder_endpoint
dmannarino Jan 29, 2025
ffe9b6b
Use this branch's Pipfile, I need unidecode
dmannarino Jan 29, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .isort.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
line_length = 88
multi_line_output = 3
include_trailing_comma = True
known_third_party = _pytest,aenum,affine,alembic,asgi_lifespan,async_lru,asyncpg,aws_utils,boto3,botocore,click,docker,ee,errors,fastapi,fiona,gdal_utils,geoalchemy2,geojson,gfw_pixetl,gino,gino_starlette,google,httpx,httpx_auth,logger,logging_utils,moto,numpy,orjson,osgeo,pandas,pendulum,pglast,psutil,psycopg2,pydantic,pyproj,pytest,pytest_asyncio,rasterio,shapely,sqlalchemy,sqlalchemy_utils,starlette,tileputty,tiles_geojson,typer
known_third_party = _pytest,aenum,affine,alembic,asgi_lifespan,async_lru,asyncpg,aws_utils,boto3,botocore,click,docker,ee,errors,fastapi,fiona,gdal_utils,geoalchemy2,geojson,gfw_pixetl,gino,gino_starlette,google,httpx,httpx_auth,logger,logging_utils,moto,numpy,orjson,osgeo,pandas,pendulum,pglast,psutil,psycopg2,pydantic,pyproj,pytest,pytest_asyncio,rasterio,shapely,sqlalchemy,sqlalchemy_utils,starlette,tileputty,tiles_geojson,typer,unidecode
23 changes: 13 additions & 10 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,32 +9,35 @@ repos:
rev: v5.10.1
hooks:
- id: isort
- repo: https://github.com/myint/docformatter
rev: v1.4
hooks:
- id: docformatter
args: [--in-place]
# Disable docformatter until it is updated to work with newer pre-commit
# The fix is committed, but no new release has been made yet. See
# https://github.com/PyCQA/docformatter/commit/06907d0267368b49b9180eed423fae5697c1e909
#- repo: https://github.com/myint/docformatter
# rev: v1.7.6
# hooks:
# - id: docformatter
# args: [--in-place]
- repo: https://github.com/ambv/black
rev: 22.12.0
rev: 24.10.0
hooks:
- id: black
language_version: python3.10
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.3.0
rev: v5.0.0
hooks:
- id: detect-aws-credentials
- id: detect-private-key
- id: trailing-whitespace
- repo: https://github.com/pycqa/flake8
rev: 6.0.0
rev: 7.1.1
hooks:
- id: flake8
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v0.971
rev: v1.14.0
hooks:
- id: mypy
- repo: https://github.com/Yelp/detect-secrets
rev: v1.3.0
rev: v1.5.0
hooks:
- id: detect-secrets
args: ['--baseline', '.secrets.baseline'] # run: `pip install detect-secrets` to establish baseline
Expand Down
2 changes: 1 addition & 1 deletion .secrets.baseline
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@
}
]
},
"version": "1.3.0",
"version": "1.5.0",
"filters_used": [
{
"path": "detect_secrets.filters.allowlist.is_line_allowlisted"
Expand Down
1 change: 1 addition & 0 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ sqlalchemy = "<1.4"
sqlalchemy-utils = "*"
starlette = "*"
typer = "*"
unidecode = "*"
uvicorn = {version = "*", extras = ["standard"]}

[requires]
Expand Down
4,349 changes: 2,211 additions & 2,138 deletions Pipfile.lock

Large diffs are not rendered by default.

8 changes: 8 additions & 0 deletions app/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
from .routes.analysis import analysis
from .routes.assets import asset, assets
from .routes.authentication import authentication
from .routes.thematic import geoencoder
from .routes.datasets import asset as version_asset
from .routes.datasets import (
dataset,
Expand Down Expand Up @@ -128,6 +129,13 @@ async def rve_error_handler(
app.include_router(r, prefix="/dataset")


################
# THEMATIC API #
################

app.include_router(geoencoder.router, prefix="/thematic")


###############
# ASSET API
###############
Expand Down
2 changes: 2 additions & 0 deletions app/routes/thematic/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
"""Here live a number of endpoints which provide higher-level services
mostly intended to make life easier for consumers of the Data API."""
216 changes: 216 additions & 0 deletions app/routes/thematic/geoencoder.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
import re
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmannarino I know you said you didn't like thematic. I was thinking of other ideas, is this admin area specific endpoints, maybe we just put it under like something like "political"? E.g. /political/geoencoder. Then the future can include additional GADM endpoints, WDPA, concessions, etc. Not sure if we need to distinguish it from the rest of the API, or we can just have it all in under a header in the docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I do think geoencoder is a little vague if it only works for admin areas, geoencoding implies converting any text of the place into coordinates: https://en.wikipedia.org/wiki/Address_geocoding

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to /political/id-lookup

from typing import Any, Dict, List, Optional

from fastapi import APIRouter, HTTPException, Query
from unidecode import unidecode

from app.crud.versions import get_version, get_version_names
from app.errors import RecordNotFoundError
from app.models.pydantic.responses import Response
from app.routes import VERSION_REGEX
from app.routes.datasets.queries import _query_dataset_json

router = APIRouter()


@router.get(
"/geoencode",
tags=["Geoencoder"],
status_code=200,
)
async def geoencode(
*,
admin_source: str = Query(
"GADM", description="The source of administrative boundaries to use."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we document that right now the only option is "GADM"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

),
admin_version: str = Query(
...,
description="Version of the administrative boundaries dataset to use.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar, is there a way to document the choices available? I guess this may get more confusing if ever have multiple providers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wild idea: should we consolidate admin version and admin dataset to one field, and have the options be like: "GADM 3.6", "GADM 4.1", "geoBoundaries 1.0", "middleEarth 3.2". Then it'll be clear what you're getting from a set of options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combining the provider and version has merit (middleEarth 3.2). It has excellent cohesion.
This is such a simple scenario that it's hard to justify breaking it up.

However, we'd be encoding a string with special information (i.e., providerspaceversion).
It's generally a good practice to split information up even though it makes for a bit more complexity.
It's more flexible in the long run (although I'm starting to violate YAGNI with this statement).
Finally, we are splitting this information up in the area/geostore microservice interface for all of the reasons mentioned above.

That's my two cents.

),
country: str = Query(
description="Name of the country to match.",
),
region: Optional[str] = Query(
None,
description="Name of the region to match.",
),
subregion: Optional[str] = Query(
None,
description="Name of the subregion to match.",
),
normalize_search: bool = Query(
True,
description="Whether or not to perform a case- and accent-insensitive search.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add "Default is to perform case- and accent-insensitive search".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think I need to even with the default argument set to True?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, no, no change needed, I see now that the Default value (if not None) is displayed in the documentation.

),
):
"""Look up administrative boundary IDs matching a specified country name
(and region name and subregion names, if specified).
"""
admin_source_to_dataset: Dict[str, str] = {"GADM": "gadm_administrative_boundaries"}

try:
dataset: str = admin_source_to_dataset[admin_source.upper()]
except KeyError:
raise HTTPException(
status_code=400,
detail=(
"Invalid admin boundary source. Valid sources:"
f" {[source for source in admin_source_to_dataset.keys()]}"
),
)

version_str: str = "v" + str(admin_version).lstrip("v")

await version_is_valid(dataset, version_str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like mentioned about the documentation above, should it be well known in advance which versions we support for providers, rather than just trying and throwing back an error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fundamentally agree that your suggestion is a good one, but it turns out to be difficult to do in practice. Especially considering the limited use of this endpoint and the fact that we will be omitting it from the docs.


names: List[str | None] = sanitize_names(
normalize_search, country, region, subregion
)

adm_level: int = determine_admin_level(*names)

sql: str = _admin_boundary_lookup_sql(
adm_level, normalize_search, admin_source, *names
)

json_data: List[Dict[str, Any]] = await _query_dataset_json(
dataset, version_str, sql, None
)

return Response(
data={
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be cleaner as a pydantic model instead of a dict

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a model, and a helper function to construct the response in hopefully a clear way.

"adminSource": admin_source,
"adminVersion": admin_version,
"matches": [
{
"country": {
"id": match["gid_0"].rsplit("_")[0],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A function for getting the ID here would be easier to read

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

"name": match["country"],
},
"region": {
"id": (
(match["gid_1"].rsplit("_")[0]).split(".")[1]
if adm_level >= 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, I personally find ternary operators declared inside dicts confusing to read

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they're all gone!

else None
),
"name": match["name_1"] if adm_level >= 1 else None,
},
"subregion": {
"id": (
(match["gid_2"].rsplit("_")[0]).split(".")[2]
if adm_level >= 2
else None
),
"name": match["name_2"] if adm_level >= 2 else None,
},
}
for match in json_data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same with nested list comprehensions as above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Banished!

],
}
)


def sanitize_names(
normalize_search: bool,
country: str | None,
region: str | None,
subregion: str | None,
) -> List[str | None]:
"""Turn any empty strings into Nones, enforces the admin level hierarchy,
and optionally unaccents and decapitalizes names.
"""
names = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type checker is happier if you change this to:

names: List[str | None] = []

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!


if subregion and not region:
raise HTTPException(
status_code=400,
detail="If subregion is specified, region must be specified as well.",
)

for name in (country, region, subregion):
if name and normalize_search:
names.append(unidecode(name).lower())
elif name:
names.append(name)
else:
names.append(None)
return names


def determine_admin_level(
country: str | None, region: str | None, subregion: str | None
) -> int:
"""Infer the native admin level of a request based on the presence of
non-empty fields
"""
if subregion:
return 2
elif region:
return 1
elif country:
return 0
else: # Shouldn't get here if FastAPI route definition worked
raise HTTPException(status_code=400, detail="Country MUST be specified.")


def _admin_boundary_lookup_sql(
adm_level: int,
normalize_search: bool,
dataset: str,
country_name: str,
region_name: str | None,
subregion_name: str | None,
) -> str:
"""Generate the SQL required to look up administrative boundary
IDs by name.
"""
name_fields: List[str] = ["country", "name_1", "name_2"]
if normalize_search:
match_name_fields = [name_field + "_normalized" for name_field in name_fields]
else:
match_name_fields = name_fields

sql = (
f"SELECT gid_0, gid_1, gid_2, {name_fields[0]}, {name_fields[1]}, {name_fields[2]}"
f" FROM {dataset} WHERE {match_name_fields[0]}='{country_name}'"
)
if region_name is not None:
sql += f" AND {match_name_fields[1]}='{region_name}'"
if subregion_name is not None:
sql += f" AND {match_name_fields[2]}='{subregion_name}'"

sql += f" AND adm_level='{adm_level}'"

return sql


async def version_is_valid(
dataset: str,
version: str,
) -> None:
"""Validate a version string for a given dataset."""
# Note: At some point I intend to change the version validator to
# use messaging like this. However, that is out of scope for the
# current ticket, and I want to see what people think, so I'll
# keep it here for now.
if re.match(VERSION_REGEX, version) is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per above, maybe just have pre-validated versions rather than checking the version

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now specified on a per-env basis!

raise HTTPException(
status_code=400,
detail=(
"Invalid version name. Version names begin with a 'v' and "
"consist of one to three integers separated by periods. "
"eg. 'v1', 'v7.1', 'v4.1.0', 'v20240801'"
),
)

try:
_ = await get_version(dataset, version)
except RecordNotFoundError:
raise HTTPException(
status_code=400,
detail=(
"Version not found. Existing versions for this dataset "
f"include {[v[0] for v in await get_version_names(dataset)]}"
# FIXME: Maybe change get_version_names to do unpacking? ^
),
)
Empty file.
Empty file.
Loading
Loading