Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: add big data workflow capabilities #396

Open
RaczeQ opened this issue Nov 2, 2023 · 5 comments
Open

Feat: add big data workflow capabilities #396

RaczeQ opened this issue Nov 2, 2023 · 5 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@RaczeQ
Copy link
Collaborator

RaczeQ commented Nov 2, 2023

Currently, the library focuses completely on the GeoPandas GeoDataFrames and requires the whole dataset from start to finish to fit on the machine. This isn't ideal, since working with bigger areas requires higher RAM usage. In this issue, we should decide which framework/library to use in the final pipeline.

Any insight from people who used those tools with any tips will be very helpful 😄

Currently available options:

  • dask-geopandas - GeoPandas extension for Dask
  • Apache Sedona - dedicated wrapper over Apache Spark and Flink for spatial operations
  • duckdb-spatial - fast in-memory db with spatial extension
  • geoarrow-python - currently developed standard for Apache Arrow for storing spatial objects
  • GeoPolars - geospatial extension for Polars, written in Rust

We should also decide if the library will depend on a single framework only, or if it will be open for extensions and implement multiple backends - similar to the ibis project. Since we write our code with abstract API, we should be able to implement multiple backends, but we will have to make sure that all results are consistent (high-quality tests) and with different backends, outputs will be either different (dask-dataframe, duckdb relation, sedona object, geodataframe, geoparquet/geofeather file path) or we will have to write an abstraction around each object to make it consistent and backends-agnostic.

We could also finish each operation with a calculated geo-parquet/arrow/feather file and work on files instead of loading them into memory.

Additional tools worth mentioning:

@RaczeQ RaczeQ pinned this issue Nov 2, 2023
@RaczeQ RaczeQ added enhancement New feature or request question Further information is requested labels Nov 2, 2023
@RaczeQ
Copy link
Collaborator Author

RaczeQ commented Nov 2, 2023

dask-geopandas

Pros:

  • Very similar to GeoPandas API
  • Can be written in Python
  • No need for external tools

Cons:

  • Dataframes are lazy and have to be computed by the user, harder to interact with in Jupyter
  • Have to rewrite the logic from GeoPandas and Python operations to change the flow to make it vectorized and create an expressions tree to be executed.
  • Calculating spatial index for faster operations (based on Hilbert Curve) replaces original index (region_id)
  • Have to learn how to partition data into chunks

@RaczeQ
Copy link
Collaborator Author

RaczeQ commented Nov 2, 2023

Apache Sedona

Pros:

  • Has many ST_ functions and H3 related functions
  • Can be written in Python

Cons:

  • Dataframes are lazy and have to be computed by the user, harder to interact with in Jupyter
  • You have to use Spark cluster - hard to use out of the box without setup

@RaczeQ
Copy link
Collaborator Author

RaczeQ commented Nov 2, 2023

Duck DB with spatial extension

Pros:

  • Very fast operations
  • No need for external tools, can be easily imported into Python
  • Dedicated OSM PBF native reader without GDAL dependency

Cons:

  • Relations can be difficult to grasp for end users, and harder to interact with in Jupyter
  • Currently no spatial index, it's sometimes slower than GeoPandas in spatial operations
  • Has to be rewritten in SQL for all operations

@RaczeQ
Copy link
Collaborator Author

RaczeQ commented Nov 2, 2023

GeoPolars

Pros:

  • Fast operations - Rust
  • No need for external tools, can be easily imported into Python
  • Can be written in Python
  • Will implement GeoArrow format in the future

Cons:

  • Not many functions are implemented
  • Still in development and isn't production-ready

@RaczeQ
Copy link
Collaborator Author

RaczeQ commented Nov 2, 2023

GeoArrow for Python

Pros:

  • Can read GeoArrow format into Pandas and PyArrow
  • Should be able to work on files and vectorized operations using Arrow
  • Can be written in Python

Cons:

  • No spatial operations are available currently (for now you have to map to Shapely object to execute a function)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant