Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xvec support #1405

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

xvec support #1405

wants to merge 3 commits into from

Conversation

ahuang11
Copy link
Collaborator

@ahuang11 ahuang11 commented Aug 30, 2024

Closes holoviz/geoviews#737

Not entirely sure what the level of support should be implemented for xvec; should this be in geoviews or just hvplot?

image image

In attempt 2 (current), I convert the xarray dataset into geopandas dataframe, "flattening" extra geometries, i.e. converting them into integer indices which is done through drop_vars, and then gathering all the xarray dims as groupby; that way, it still shows up as slider widgets

Screen.Recording.2024-08-29.at.5.08.31.PM.mov

However, the integers aren't meaningful so I was wondering if there are extra geometries, should I overlay centroid points of the other geometries? If so, how do I even do that in hvplot?

In attempt 1, I try to keep it in its xarray data structure, but requires much more change in hvplot to plot geometries nested in xarray.

cc: @hoxbro


import geopandas as gpd
import pandas as pd
import hvplot.pandas

import xarray as xr

uri = "gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-6h-0p25deg-chunk-1.zarr-v2"
era5_ds_sub = (
    # Open the dataset
    xr.open_zarr(uri, chunks={"time": 48}, consolidated=True)
    # Select the near-surface level
    .isel(level=0, drop=True)
    # subset in time
    .sel(time=slice("2017-01", "2018-01"))
    # reduce to two arrays
    [["2m_temperature", "u_component_of_wind"]]
)
era5_ds_sub


cities_df = pd.read_json(
    "hf://datasets/jamescalam/world-cities-geo/train.jsonl", lines=True
)
cities_eur = cities_df.loc[cities_df["continent"] == "Europe"]
cities_eur = gpd.GeoDataFrame(
    cities_eur,
    geometry=gpd.points_from_xy(cities_eur.longitude, cities_eur.latitude),
    crs="EPSG:4326",
).drop(["latitude", "longitude", "x", "y", "z"], axis=1)
import xvec

era5_europe_cities = era5_ds_sub.xvec.extract_points(
    cities_eur.geometry, x_coords="longitude", y_coords="latitude"
).drop_vars("index")

era5_europe_cities["2m_temperature"].isel(time=slice(0, 2)).hvplot()
import geopandas as gpd
import numpy as np
import pandas as pd
import xarray as xr
import xvec
import hvplot.xarray

from geodatasets import get_path
chicago = gpd.read_file(get_path("geoda.chicago health"))

origin = destination = chicago.geometry.array
mode = ["car", "bike", "foot"]
date = pd.date_range("2023-01-01", periods=100)
hours = range(24)
rng = np.random.default_rng(1)
data = rng.integers(1, 100, size=(3, 100, 24, len(chicago), len(chicago)))
traffic_counts = xr.DataArray(
    data,
    coords=(mode, date, hours, origin, destination),
    dims=["mode", "date", "time", "origin", "destination"],
    name="traffic_counts",
).xvec.set_geom_indexes(["origin", "destination"], crs=chicago.crs)
traffic_counts.sel(date="2023-02-28", time=12, mode="bike").hvplot("traffic_counts", hover_cols=["date", "time"])

Copy link

codecov bot commented Aug 30, 2024

Codecov Report

Attention: Patch coverage is 26.66667% with 11 lines in your changes missing coverage. Please review.

Project coverage is 88.68%. Comparing base (6c96c7e) to head (3243cbb).
Report is 19 commits behind head on main.

Files with missing lines Patch % Lines
hvplot/converter.py 10.00% 9 Missing ⚠️
hvplot/util.py 60.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1405      +/-   ##
==========================================
+ Coverage   87.39%   88.68%   +1.28%     
==========================================
  Files          50       51       +1     
  Lines        7490     7509      +19     
==========================================
+ Hits         6546     6659     +113     
+ Misses        944      850      -94     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@philippjfr philippjfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems simple enough to me and I'm okay with merging BUT as we keep adding data backends that simply convert I'd maybe suggest a plugin mechanism where you can register a converter function rather than adding a bunch of if/else cases.

Copy link
Member

@maximlt maximlt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we add support for a new data backend I'd like to see us explain more in depth why we want to add it. Things like what motivated the PR, how well established and maintained the backend is, how popular it is, etc. I personally don't know xvec and neither this PR nor the GeoViews issue provides much information about it. It may have been mentioned in a HoloViz meeting but I also saw no trace of that looking at the notes.

@ahuang11
Copy link
Collaborator Author

ahuang11 commented Oct 18, 2024

FWIW it's contributed by the maintainers of xarray and geopandas
https://github.com/xarray-contrib/xvec/graphs/contributors

It's quite new, so not very established, but because it's by maintainers of xarray/geopandas + part of earthmover (the cofounders started the Pangeo movement), I imagine it'll establish its name over time.

The motivation can be found in the blog post
https://earthmover.io/blog/vector-datacube-pt1

Some data is more naturally represented as a multi-dimensional cube. Consider a collection of weather stations that record temperature and windspeed. These measurements are stored in the columns of a geopandas.GeoDataFrame, while the coordinates of each weather station are stored as Shapely Point geometries in a geometry column. We can quickly access a lot of information and ask questions such as “how do temperatures vary across the elevation range covered by the weather stations”, and “where are windspeeds highest?” But, each time the weather station records a measurement, we get a new set of data for each variable. How should that new data be incorporated into the GeoDataFrame? While there are ways of representing such multi-dimensional data in tabular form (see Pebesma, 2022), the column structure is still fundamentally one-dimensional, and these strategies all involve duplicating data along either the row or column dimension.

In the weather station example, the data are fundamentally two-dimensional ([location, time]) and must be flattened to fit into a dataframe. Contrast this to raster data cubes, where data is explicitly represented as multi-dimensional. In this data model, adding new dimensions is easy, and popular tools reflect this fundamental concept. What would it look like, and how would our workflows change, if vector data were also represented as a cube?

Also, unsure whether this should be a part of geoviews first before hvplot

@maximlt
Copy link
Member

maximlt commented Oct 19, 2024

Thanks for the details. It looks indeed that it's very early stage in terms of adoption:
image

I found this issue (xarray-contrib/xvec#82) on their repo where they're discussion plotting capabilities. I'd encourage you to chime in and see with them if we could easily provide solutions. If a collaboration gets established, there's more chance we'll be successful, i.e. the interface we build ends up actually being used by real users.

Also, unsure whether this should be a part of geoviews first before hvplot

Also unsure, I imagine in hvPlot it should be integrated as a simple conversion layer while in GeoViews it'd be more involved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support vector data cubes in xarray
3 participants