Feat: add big data workflow capabilities #396

RaczeQ · 2023-11-02T08:30:02Z

Currently, the library focuses completely on the GeoPandas GeoDataFrames and requires the whole dataset from start to finish to fit on the machine. This isn't ideal, since working with bigger areas requires higher RAM usage. In this issue, we should decide which framework/library to use in the final pipeline.

Any insight from people who used those tools with any tips will be very helpful 😄

Currently available options:

dask-geopandas - GeoPandas extension for Dask
Apache Sedona - dedicated wrapper over Apache Spark and Flink for spatial operations
duckdb-spatial - fast in-memory db with spatial extension
geoarrow-python - currently developed standard for Apache Arrow for storing spatial objects
GeoPolars - geospatial extension for Polars, written in Rust

We should also decide if the library will depend on a single framework only, or if it will be open for extensions and implement multiple backends - similar to the ibis project. Since we write our code with abstract API, we should be able to implement multiple backends, but we will have to make sure that all results are consistent (high-quality tests) and with different backends, outputs will be either different (dask-dataframe, duckdb relation, sedona object, geodataframe, geoparquet/geofeather file path) or we will have to write an abstraction around each object to make it consistent and backends-agnostic.

We could also finish each operation with a calculated geo-parquet/arrow/feather file and work on files instead of loading them into memory.

Additional tools worth mentioning:

https://github.com/rapidsai/cuspatial - CUDA accelerated spatial operations

The text was updated successfully, but these errors were encountered:

RaczeQ · 2023-11-02T08:51:14Z

dask-geopandas

Pros:

Very similar to GeoPandas API
Can be written in Python
No need for external tools

Cons:

Dataframes are lazy and have to be computed by the user, harder to interact with in Jupyter
Have to rewrite the logic from GeoPandas and Python operations to change the flow to make it vectorized and create an expressions tree to be executed.
Calculating spatial index for faster operations (based on Hilbert Curve) replaces original index (region_id)
Have to learn how to partition data into chunks

RaczeQ · 2023-11-02T08:52:27Z

Apache Sedona

Pros:

Has many ST_ functions and H3 related functions
Can be written in Python

Cons:

Dataframes are lazy and have to be computed by the user, harder to interact with in Jupyter
You have to use Spark cluster - hard to use out of the box without setup

RaczeQ · 2023-11-02T08:57:42Z

Duck DB with spatial extension

Pros:

Very fast operations
No need for external tools, can be easily imported into Python
Dedicated OSM PBF native reader without GDAL dependency

Cons:

Relations can be difficult to grasp for end users, and harder to interact with in Jupyter
Currently no spatial index, it's sometimes slower than GeoPandas in spatial operations
Has to be rewritten in SQL for all operations

RaczeQ · 2023-11-02T09:00:29Z

GeoPolars

Pros:

Fast operations - Rust
No need for external tools, can be easily imported into Python
Can be written in Python
Will implement GeoArrow format in the future

Cons:

Not many functions are implemented
Still in development and isn't production-ready

RaczeQ · 2023-11-02T09:05:52Z

GeoArrow for Python

Pros:

Can read GeoArrow format into Pandas and PyArrow
Should be able to work on files and vectorized operations using Arrow
Can be written in Python

Cons:

No spatial operations are available currently (for now you have to map to Shapely object to execute a function)

RaczeQ pinned this issue Nov 2, 2023

RaczeQ added enhancement New feature or request question Further information is requested labels Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: add big data workflow capabilities #396

Feat: add big data workflow capabilities #396

RaczeQ commented Nov 2, 2023 •

edited

Loading

RaczeQ commented Nov 2, 2023 •

edited

Loading

RaczeQ commented Nov 2, 2023

RaczeQ commented Nov 2, 2023

RaczeQ commented Nov 2, 2023

RaczeQ commented Nov 2, 2023

Feat: add big data workflow capabilities #396

Feat: add big data workflow capabilities #396

Comments

RaczeQ commented Nov 2, 2023 • edited Loading

RaczeQ commented Nov 2, 2023 • edited Loading

RaczeQ commented Nov 2, 2023

RaczeQ commented Nov 2, 2023

RaczeQ commented Nov 2, 2023

RaczeQ commented Nov 2, 2023

RaczeQ commented Nov 2, 2023 •

edited

Loading

RaczeQ commented Nov 2, 2023 •

edited

Loading