(Updated) Proof-of-concept Parquet GEOMETRY and GEOGRAPHY logical type implementations #45459

paleolimbot · 2025-02-07T04:58:40Z

Rationale for this change

The GEOMETRY and GEOGRAPHY logical types are being proposed as an addition to the Parquet format.

What changes are included in this PR?

This is a continuation of @Kontinuation 's initial PR (#43977), which included:

Added geometry logical types (printing, serialization, deserialization)
Added geometry column statistics (serialization, deserialization, writing)
Support reading/writing parquet files containing geometry columns

Changes after this included:

Rebasing on the latest apache/arrow
Split geography/geometry types
Synchronize the final parameter names (e.g., no more "encoding", "edges" -> "algorithm")

I think we still need a few more for this to be merged (pending the format change vote):

Update the bounding box logic to implement the "wraparound" bounding boxes where max > min (and generally make sure the stats for geography are implemented for trivial cases)
Handle propagation of the parameters to Arrow (I think we can do this via GeoArrow if that's desired even without a canonical extension type)

Are these changes tested?

Yes!

Are there any user-facing changes?

Yes!

Example from the included Python bindings:

import pyarrow as pa
from pyarrow import parquet
import geoarrow.pyarrow as ga  # For registering the extension type
import geopandas

path = "/Users/dewey/gh/parquet-testing/data/geospatial/example-crs_vermont-4326.parquet"
file = parquet.ParquetFile(path)
file.schema
#> <pyarrow._parquet.ParquetSchema object at 0x1136ee600>
#> required group field_id=-1 schema {
#>   optional binary field_id=-1 geometry (Geometry(crs=));
#> }
file.metadata.metadata
#> (eventually should contain any CRSes that were dumped there)
geometry_index = len(file.schema.names) - 1
file.metadata.row_group(0).column(geometry_index).geospatial_statistics
#> <pyarrow._parquet.GeospatialStatistics object at 0x117b07f40>
#>   geospatial_types: [3]
#>   xmin: -73.4296726142165
#>   xmax: -71.50351111518535
#>   ymin: 42.72708222103286
#>   ymax: 45.00831248634144
#>   zmin: None
#>   zmax: None
#>   mmin: None
#>   mmax: None

# Type and CRS should propagate through
file.schema_arrow.field("geometry").type
#> WkbType(geoarrow.wkb <OGC:CRS84>)

# GeoPandas should be able to take the result of this and ensure
# the CRS is not lost (and that the geometry column is picked up)
df = geopandas.GeoDataFrame.from_arrow(file.read())
df.geometry.crs.name
#> 'WGS 84 (CRS84)'
df.geometry.head(5)
#> 0    POLYGON ((-72.45707 42.72708, -73.28203 42.743...
#> Name: geometry, dtype: geometry

Co-authored-by: Gang Wu <[email protected]>

github-actions · 2025-02-07T04:59:04Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Updated) Proof-of-concept Parquet GEOMETRY and GEOGRAPHY logical type implementations #45459

(Updated) Proof-of-concept Parquet GEOMETRY and GEOGRAPHY logical type implementations #45459

paleolimbot commented Feb 7, 2025 •

edited

Loading

github-actions bot commented Feb 7, 2025

(Updated) Proof-of-concept Parquet GEOMETRY and GEOGRAPHY logical type implementations #45459

Are you sure you want to change the base?

(Updated) Proof-of-concept Parquet GEOMETRY and GEOGRAPHY logical type implementations #45459

Conversation

paleolimbot commented Feb 7, 2025 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Feb 7, 2025

paleolimbot commented Feb 7, 2025 •

edited

Loading