-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: add big data workflow capabilities #396
Labels
Comments
RaczeQ
added
enhancement
New feature or request
question
Further information is requested
labels
Nov 2, 2023
dask-geopandas Pros:
Cons:
|
Apache Sedona Pros:
Cons:
|
Duck DB with spatial extension Pros:
Cons:
|
GeoPolars Pros:
Cons:
|
GeoArrow for Python Pros:
Cons:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Currently, the library focuses completely on the GeoPandas
GeoDataFrames
and requires the whole dataset from start to finish to fit on the machine. This isn't ideal, since working with bigger areas requires higher RAM usage. In this issue, we should decide which framework/library to use in the final pipeline.Any insight from people who used those tools with any tips will be very helpful 😄
Currently available options:
dask-geopandas
- GeoPandas extension for DaskApache Sedona
- dedicated wrapper over Apache Spark and Flink for spatial operationsduckdb-spatial
- fast in-memory db with spatial extensiongeoarrow-python
- currently developed standard for Apache Arrow for storing spatial objectsGeoPolars
- geospatial extension for Polars, written in RustWe should also decide if the library will depend on a single framework only, or if it will be open for extensions and implement multiple backends - similar to the
ibis
project. Since we write our code with abstract API, we should be able to implement multiple backends, but we will have to make sure that all results are consistent (high-quality tests) and with different backends, outputs will be either different (dask-dataframe, duckdb relation, sedona object, geodataframe, geoparquet/geofeather file path) or we will have to write an abstraction around each object to make it consistent and backends-agnostic.We could also finish each operation with a calculated geo-parquet/arrow/feather file and work on files instead of loading them into memory.
Additional tools worth mentioning:
The text was updated successfully, but these errors were encountered: