Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore GeoParquet file format as an input to staging #40

Open
julietcohen opened this issue Feb 7, 2024 · 2 comments
Open

Explore GeoParquet file format as an input to staging #40

julietcohen opened this issue Feb 7, 2024 · 2 comments
Labels
good first issue Good for newcomers

Comments

@julietcohen
Copy link
Collaborator

julietcohen commented Feb 7, 2024

Apache Parquet is described as a modern alternative to csv files, and GeoParquet adds interoperable geospatial types (Point, Line, Polygon) to Parquet (source). Initial exploration is needed to determine if and how we can stage vector data in GeoParquet format. This format should be great for processing large quantities of data as it increases efficiency in analytical based use cases.

Suggested by Ingmar Nitze. A good initial step would be to either find a small GeoParquet file or receive one from Ingmar. This should be uploaded to /var/data/submission/pdg/...

@julietcohen
Copy link
Collaborator Author

Ingmar provided 2 parquet files for data in adjacent UTM zones 32617 & 32618. These have been been uploaded to a new directory: /var/data/submission/pdg/nitze_lake_change/data_sample_parquet_20240219/
This directory also contains 2 geopackage files of the same data.

@julietcohen
Copy link
Collaborator Author

In order to import a parquet file:

  • we can use geopandas.read_parquet.() and it imports the file as a geodataframe (identifies the geometry column without the user needing to specify)
  • need to package installed pyarrow
  • I also had installed geoparquet package into my python env but we can test is that is necessary
import geopandas as gpd
import geoparquet
import pyarrow

data = gpd.read_parquet("/var/data/submission/pdg/nitze_lake_change/data_sample_parquet_20240219/32617_river.parquet")

In the config, we ask the user to specify the extension of the input file here. ext_input is used here when we pair footprints to their vector files (if we are visualizing data that has footprints).

When we read in vectors to staged, we use geopandas.read_file(). Just before this, we can insert a check for the value of ext_input in the config like this:

ext_input = config.get('ext_input')

And use geopandas.read_parquet() instead of read_file() if the ext_input is ".parquet", and use read_file() if the extension is anything else.

@julietcohen julietcohen added the good first issue Good for newcomers label Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
Status: No status
Development

No branches or pull requests

1 participant