sharing dimensions across dataarrays in a dataset #8177

cchrysostomou · 2017-07-07T14:58:18Z

cchrysostomou
Jul 7, 2017

I have two questions regarding proper implementation of an xarray dataset when defining dimensions. First, I am wondering whether I can share the same dimension across multiple arrays in a dataset without storing NaN values for coordinates not present in each respective array.

As a simple example, I am interested in creating two data arrays that involve the shared dimensions x and y; however, in the first data array, I only care about x-coordinates from (0->5) whereas in the second data array I only care about x-coordinates from (10-> 12)

vals1 = np.random.normal(size=(6,2))
vals2 = np.random.normal(size=(3,3))
x1 = xr.Dataset(
    {'table1': (['x', 'y'], vals1)},
    coords={
        'x': np.arange(6),
        'y': np.arange(2)
    }
)
x2 = xr.Dataset(
    {'table2': (['x', 'y'], vals2)},
    coords={
        'x': np.arange(10, 10+3),
        'y': np.arange(8, 8+3)
    }
)

If I naively merge the two datasets, then the dimensions and coordinates get merged correctly but not each of the data variables within the dataset are much larger than they need to be (store unnecessary nan values)

merged = xr.merge([x1,x2])
merged['table1']

<xarray.DataArray 'table1' (x: 9, y: 5)>
array([[ 0.553098, -1.157813,       nan,       nan,       nan],
       [-0.259999, -0.476526,       nan,       nan,       nan],
       [ 1.650893, -0.364517,       nan,       nan,       nan],
       [ 0.16149 , -0.037587,       nan,       nan,       nan],
       [ 0.799689, -0.128728,       nan,       nan,       nan],
       [-0.613603, -1.410235,       nan,       nan,       nan],
       [      nan,       nan,       nan,       nan,       nan],
       [      nan,       nan,       nan,       nan,       nan],
       [      nan,       nan,       nan,       nan,       nan]])
Coordinates:
  * y        (y) int64 0 1 8 9 10
  * x        (x) int64 0 1 2 3 4 5 10 11 12

In my second question, I want to add an extra layer of complexity to this and add a third variable that uses multi-indexing. Again naively, I would have wanted the multi-index in the third table to share dimensions (x and y) from the previous data variables

# I would have preferred to do this
index=pd.MultiIndex.from_tuples([(0, 0, 1), (1, 1, 1), (2, 2, 1)], names=('x', 'y', 'z'))
vals3 = np.random.normal(size=(3,3))
x3 = xr.Dataset(
        {'table3-multiindex': (['multi-index', 'cols'], vals3)},        
        coords={'multi-index': index}
)

# Except, merging with previous dataset raises an error due to name conflicts
xr.merge([x1, x2, x3])

ValueError: conflicting MultiIndex level name(s):
'y' (multi-index), (y)
'x' (multi-index), (x)

Currently my solution is just to rename each of the dimensions in each respective data array so that they do not overlap. While this is not ideal, I can probably get away with this, but since I would prefer the ability to share dimensions without adding in NaN values, is there another way to achieve this? (Im also assuming that I can still do joins later on using values within different dimension names.)

# current solution, merge data arrays but have each dimension be unique
vals1 = np.random.normal(size=(6,2))
vals2 = np.random.normal(size=(3,3))
x1 = xr.Dataset(
    {'table1': (['x1', 'y1'], vals1)},
    coords={
        'x1': np.arange(6),
        'y1': np.arange(2)
    }
)
x2 = xr.Dataset(
    {'table2': (['x2', 'y2'], vals2)},
    coords={
        'x2': np.arange(10, 10+3),
        'y2': np.arange(8, 8+3)
    }
)

index=pd.MultiIndex.from_tuples([(0, 0, 1), (1, 1, 1), (2, 2, 1)], names=('x3', 'y3', 'z3'))
vals3 = np.random.normal(size=(3,3))
x3 = xr.Dataset(
        {'table3-multiindex': (['multi-index', 'cols'], vals3)},        
        coords={'multi-index': index}
)

xr.merge([x1, x2, x3])

shoyer · 2017-07-07T15:48:05Z

shoyer
Jul 7, 2017
Maintainer

I'm afraid this isn't possible, by design. Every variable in a Dataset sharing the same coordinate system is enforced as part of the xarray data model. This makes data analysis and comparison with a Dataset quite straightforward, since everything is already on the same grid.

For cases where you need different coordinate values and/or dimension sizes, your options are to either rename dimensions for different variables or use multiple Dataset/DataArray objects (Python has nice built-in data structures).

In theory, we could add something like an "UnalignedDataset" that supports most of the Dataset methods without requiring alignment but I'm not sure it's worth the trouble.

0 replies

smartass101 · 2018-10-16T17:24:42Z

smartass101
Oct 16, 2018

I've hit this design limitation quite often as well, with several use-cases, both in experiment and simulation. It detracts from xarray's power of conveniently and transparently handling coordinate meta-data. From the Why xarray? page:

with xarray, you don’t need to keep track of the order of arrays dimensions or insert dummy dimensions

Adding effectively dummy dimensions or coordinates is essentially what this alignment design is forcing us to do.

A possible solution would be something like having (some) coordinate arrays in an (Unaligned)Dataset being a "reducible" (it would reduce to Index for each Datarray) MultiIndex. A workaround can be using MultiIndex coordinates directly, but then alignment cannot be done easily as levels do not behave as real dimensions.

Use-cases examples:

1. coordinate "metadata"

I often have measurements on related axes, but also with additional coordinates (different positions, etc.) Consider:

import numpy as np
import xarray as xr
n1 = np.arange(1, 22)
m1 = xr.DataArray(n1*0.5, coords={'num': n1, 'B': 'r', 'ar' :'A'}, dims=['num'], name='MA')
n2 = np.arange(2, 21)
m2 = xr.DataArray(n2*0.5, coords={'num': n2, 'B': 'r', 'ar' :'B'}, dims=['num'], name='MB')
ds = xr.merge([m1, m2])
print(ds)

What I would like to get (pseudocode):

<xarray.Dataset>
Dimensions:  (num: 21, ar:2)   # <-- note that MB is still of dims {'num': 19} only
Coordinates:              # <-- mostly unions as done by concat
  * num      (num) int64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
    B        <U1 'r'
  * ar       <U1 'A' 'B'    # <-- this is now a dim of the dataset, but not of MA or MB
Data variables:
    MA       (num) float64 0.5 1.0 1.5 2.0 2.5 3.0 ... 8.0 8.5 9.0 9.5 10.0 10.5
    MB       (num) float64 1.0 1.5 2.0 2.5 3.0 3.5 ... 7.5 8.0 8.5 9.0 9.5 10.0

Instead I get

MergeError: conflicting values for variable 'ar' on objects to be combined:
first value: <xarray.Variable ()>
array('A', dtype='<U1')
second value: <xarray.Variable ()>
array('B', dtype='<U1')

While it is possible to concat into something with dimensions (num, ar, B), it often results in huge arrays where most values are nan.
I could also store the "position" metadata as attrs, but that pretty much defeats the point of using xarray to have coordinates transparently part of the coordinate metadata. Also, sometimes I would like to select arrays from the dataset from a given location, e.g. Dataset.sel(ar='B').

2. unaligned time domains

This s a large problem especially when different time-bases are involved. A difference in sampling intervals will blow up the storage by a huge number of nan values. Which of course greatly complicates further calculations, e.g. filtering in the time domain. Or just non-overlaping time intervals will require at least double the storage area.

I often find myself resorting rather to pandas.MultiIndex which gladly manages such non-aligned coordinates while still enabling slicing and selection on various levels. So it can be done and the pandas code and functionality already exists.

0 replies

shoyer · 2018-10-16T19:00:16Z

shoyer
Oct 16, 2018
Maintainer

You can use a pandas.MultiIndex with xarray. The interface/abstraction could be improved and has some rough edges (e.g., see especially #1603), but I think this is the preferred way to support these use cases. It does already work for indexing.

0 replies

smartass101 · 2018-10-18T09:48:20Z

smartass101
Oct 18, 2018

I indeed often resort to using a pandas.MultiIndex, but especially the dropping of the selected coordinate value (#1408) makes it quite inconvenient.

0 replies

shoyer · 2018-10-18T15:21:24Z

shoyer
Oct 18, 2018
Maintainer

I'm marking #1408 as a bug so we won't forget about it. Hopefully it should be fixed automatically as part of the "explicit indexes" refactor.

…

On Thu, Oct 18, 2018 at 2:48 AM Ondrej Grover ***@***.***> wrote: I indeed often resort to using a pandas.MultiIndex, but especially the dropping of the selected coordinate value (#1408 <#1408>) makes it quite inconvenient. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1471 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1pnDztKWoTaWEjzPpP6orveOMNWRks5umE6BgaJpZM4ORDdd> .

0 replies

tommylees112 · 2018-10-29T15:21:34Z

tommylees112
Oct 29, 2018

@smartass101 & @shoyer what would be the code for working with a pandas.MultiIndex object in this use case? Could you show how it would work related to your example above:

<xarray.Dataset>
Dimensions:  (num: 21, ar:2)   # <-- note that MB is still of dims {'num': 19} only
Coordinates:              # <-- mostly unions as done by concat
  * num      (num) int64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
    B        <U1 'r'
  * ar       <U1 'A' 'B'    # <-- this is now a dim of the dataset, but not of MA or MB
Data variables:
    MA       (num) float64 0.5 1.0 1.5 2.0 2.5 3.0 ... 8.0 8.5 9.0 9.5 10.0 10.5
    MB       (num) float64 1.0 1.5 2.0 2.5 3.0 3.5 ... 7.5 8.0 8.5 9.0 9.5 10.0

I am working with land surface model outputs. I have lots of one-dimensional data for different lat/lon points, at different times. I want to join them all into one dataset to make plotting easier. E.g. plot the evapotranspiration estimates for all the stations at their x,y coordinates.

Thanks very much!

0 replies

zbarry · 2019-08-26T15:00:35Z

zbarry
Aug 26, 2019

I just wanted to chime in as to the usefulness of being able to do something like this without the extra mental overhead being required by the workaround proposed. My use case parallels @smartass101's very closely. Have there been any updates to xarray since last year that might make streamlining this use case a bit more feasible, by any chance? :)

0 replies

marcel-goldschen-ohm · 2023-08-15T20:15:42Z

marcel-goldschen-ohm
Aug 15, 2023

I also would love this feature. Consider a dataset with many 1-D time series recordings across repeated sweeps and stimulus series all stored in one N-D array (e.g., dimensions ['series', 'sweep', 'time']). For one or a small number of these time series there is an artifactual wobble in the baseline that needs removing. It would be great to be able to store these detrended time series in the same dataset as a new array that shares the same time coords, but has a subset of the series and sweep coords. This would require that each DataArray in a Dataset can have its own coords which would supersede the Dataset's coords when defined.

I imagine there are likely many use cases for working with a subset of a dataset along some, but not all dimensions. In all those cases, it would be convenient to house that data within the same dataset rather than having a bunch of datasets with repeated coord arrays for shared dimensions.

Given how stale this discussion is, I imagine this would probably be too much of a headache. But it would be very nice. I am in favor of the idea of a new UnalignedDataset as suggested by @shoyer.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sharing dimensions across dataarrays in a dataset #8177

{{title}}

Replies: 8 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

sharing dimensions across dataarrays in a dataset #8177

cchrysostomou Jul 7, 2017

Replies: 8 comments

shoyer Jul 7, 2017 Maintainer

smartass101 Oct 16, 2018

Use-cases examples:

1. coordinate "metadata"

2. unaligned time domains

shoyer Oct 16, 2018 Maintainer

smartass101 Oct 18, 2018

shoyer Oct 18, 2018 Maintainer

tommylees112 Oct 29, 2018

zbarry Aug 26, 2019

marcel-goldschen-ohm Aug 15, 2023

cchrysostomou
Jul 7, 2017

shoyer
Jul 7, 2017
Maintainer

smartass101
Oct 16, 2018

shoyer
Oct 16, 2018
Maintainer

smartass101
Oct 18, 2018

shoyer
Oct 18, 2018
Maintainer

tommylees112
Oct 29, 2018

zbarry
Aug 26, 2019

marcel-goldschen-ohm
Aug 15, 2023