Explore GeoParquet file format as an input to staging #40

julietcohen · 2024-02-07T23:34:18Z

Apache Parquet is described as a modern alternative to csv files, and GeoParquet adds interoperable geospatial types (Point, Line, Polygon) to Parquet (source). Initial exploration is needed to determine if and how we can stage vector data in GeoParquet format. This format should be great for processing large quantities of data as it increases efficiency in analytical based use cases.

Suggested by Ingmar Nitze. A good initial step would be to either find a small GeoParquet file or receive one from Ingmar. This should be uploaded to /var/data/submission/pdg/...

The text was updated successfully, but these errors were encountered:

julietcohen · 2024-02-20T23:14:41Z

Ingmar provided 2 parquet files for data in adjacent UTM zones 32617 & 32618. These have been been uploaded to a new directory: /var/data/submission/pdg/nitze_lake_change/data_sample_parquet_20240219/
This directory also contains 2 geopackage files of the same data.

julietcohen · 2024-04-25T22:49:08Z

In order to import a parquet file:

we can use geopandas.read_parquet.() and it imports the file as a geodataframe (identifies the geometry column without the user needing to specify)
need to package installed pyarrow
I also had installed geoparquet package into my python env but we can test is that is necessary

import geopandas as gpd
import geoparquet
import pyarrow

data = gpd.read_parquet("/var/data/submission/pdg/nitze_lake_change/data_sample_parquet_20240219/32617_river.parquet")

In the config, we ask the user to specify the extension of the input file here. ext_input is used here when we pair footprints to their vector files (if we are visualizing data that has footprints).

When we read in vectors to staged, we use geopandas.read_file(). Just before this, we can insert a check for the value of ext_input in the config like this:

ext_input = config.get('ext_input')

And use geopandas.read_parquet() instead of read_file() if the ext_input is ".parquet", and use read_file() if the extension is anything else.

julietcohen added the good first issue Good for newcomers label Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore GeoParquet file format as an input to staging #40

Explore GeoParquet file format as an input to staging #40

julietcohen commented Feb 7, 2024 •

edited

Loading

julietcohen commented Feb 20, 2024

julietcohen commented Apr 25, 2024

Explore GeoParquet file format as an input to staging #40

Explore GeoParquet file format as an input to staging #40

Comments

julietcohen commented Feb 7, 2024 • edited Loading

julietcohen commented Feb 20, 2024

julietcohen commented Apr 25, 2024

julietcohen commented Feb 7, 2024 •

edited

Loading