sharing dimensions across dataarrays in a dataset #8177
Replies: 8 comments
-
I'm afraid this isn't possible, by design. Every variable in a Dataset sharing the same coordinate system is enforced as part of the xarray data model. This makes data analysis and comparison with a Dataset quite straightforward, since everything is already on the same grid. For cases where you need different coordinate values and/or dimension sizes, your options are to either rename dimensions for different variables or use multiple Dataset/DataArray objects (Python has nice built-in data structures). In theory, we could add something like an "UnalignedDataset" that supports most of the Dataset methods without requiring alignment but I'm not sure it's worth the trouble. |
Beta Was this translation helpful? Give feedback.
-
I've hit this design limitation quite often as well, with several use-cases, both in experiment and simulation. It detracts from xarray's power of conveniently and transparently handling coordinate meta-data. From the Why xarray? page:
Adding effectively dummy dimensions or coordinates is essentially what this alignment design is forcing us to do. A possible solution would be something like having (some) coordinate arrays in an (Unaligned)Dataset being a "reducible" (it would reduce to Index for each Datarray) MultiIndex. A workaround can be using MultiIndex coordinates directly, but then alignment cannot be done easily as levels do not behave as real dimensions. Use-cases examples:1. coordinate "metadata"I often have measurements on related axes, but also with additional coordinates (different positions, etc.) Consider: import numpy as np
import xarray as xr
n1 = np.arange(1, 22)
m1 = xr.DataArray(n1*0.5, coords={'num': n1, 'B': 'r', 'ar' :'A'}, dims=['num'], name='MA')
n2 = np.arange(2, 21)
m2 = xr.DataArray(n2*0.5, coords={'num': n2, 'B': 'r', 'ar' :'B'}, dims=['num'], name='MB')
ds = xr.merge([m1, m2])
print(ds) What I would like to get (pseudocode): <xarray.Dataset>
Dimensions: (num: 21, ar:2) # <-- note that MB is still of dims {'num': 19} only
Coordinates: # <-- mostly unions as done by concat
* num (num) int64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
B <U1 'r'
* ar <U1 'A' 'B' # <-- this is now a dim of the dataset, but not of MA or MB
Data variables:
MA (num) float64 0.5 1.0 1.5 2.0 2.5 3.0 ... 8.0 8.5 9.0 9.5 10.0 10.5
MB (num) float64 1.0 1.5 2.0 2.5 3.0 3.5 ... 7.5 8.0 8.5 9.0 9.5 10.0 Instead I get MergeError: conflicting values for variable 'ar' on objects to be combined:
first value: <xarray.Variable ()>
array('A', dtype='<U1')
second value: <xarray.Variable ()>
array('B', dtype='<U1') While it is possible to 2. unaligned time domainsThis s a large problem especially when different time-bases are involved. A difference in sampling intervals will blow up the storage by a huge number of nan values. Which of course greatly complicates further calculations, e.g. filtering in the time domain. Or just non-overlaping time intervals will require at least double the storage area. I often find myself resorting rather to |
Beta Was this translation helpful? Give feedback.
-
You can use a |
Beta Was this translation helpful? Give feedback.
-
I indeed often resort to using a |
Beta Was this translation helpful? Give feedback.
-
I'm marking #1408 as a bug so we won't forget about it. Hopefully it should
be fixed automatically as part of the "explicit indexes" refactor.
…On Thu, Oct 18, 2018 at 2:48 AM Ondrej Grover ***@***.***> wrote:
I indeed often resort to using a pandas.MultiIndex, but especially the
dropping of the selected coordinate value (#1408
<#1408>) makes it quite
inconvenient.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1471 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1pnDztKWoTaWEjzPpP6orveOMNWRks5umE6BgaJpZM4ORDdd>
.
|
Beta Was this translation helpful? Give feedback.
-
@smartass101 & @shoyer what would be the code for working with a
I am working with land surface model outputs. I have lots of one-dimensional data for different lat/lon points, at different times. I want to join them all into one dataset to make plotting easier. E.g. plot the evapotranspiration estimates for all the stations at their x,y coordinates. Thanks very much! |
Beta Was this translation helpful? Give feedback.
-
I just wanted to chime in as to the usefulness of being able to do something like this without the extra mental overhead being required by the workaround proposed. My use case parallels @smartass101's very closely. Have there been any updates to xarray since last year that might make streamlining this use case a bit more feasible, by any chance? :) |
Beta Was this translation helpful? Give feedback.
-
I also would love this feature. Consider a dataset with many 1-D time series recordings across repeated sweeps and stimulus series all stored in one N-D array (e.g., dimensions ['series', 'sweep', 'time']). For one or a small number of these time series there is an artifactual wobble in the baseline that needs removing. It would be great to be able to store these detrended time series in the same dataset as a new array that shares the same time coords, but has a subset of the series and sweep coords. This would require that each DataArray in a Dataset can have its own coords which would supersede the Dataset's coords when defined. I imagine there are likely many use cases for working with a subset of a dataset along some, but not all dimensions. In all those cases, it would be convenient to house that data within the same dataset rather than having a bunch of datasets with repeated coord arrays for shared dimensions. Given how stale this discussion is, I imagine this would probably be too much of a headache. But it would be very nice. I am in favor of the idea of a new UnalignedDataset as suggested by @shoyer. |
Beta Was this translation helpful? Give feedback.
-
I have two questions regarding proper implementation of an xarray dataset when defining dimensions. First, I am wondering whether I can share the same dimension across multiple arrays in a dataset without storing NaN values for coordinates not present in each respective array.
As a simple example, I am interested in creating two data arrays that involve the shared dimensions x and y; however, in the first data array, I only care about x-coordinates from (0->5) whereas in the second data array I only care about x-coordinates from (10-> 12)
If I naively merge the two datasets, then the dimensions and coordinates get merged correctly but not each of the data variables within the dataset are much larger than they need to be (store unnecessary nan values)
In my second question, I want to add an extra layer of complexity to this and add a third variable that uses multi-indexing. Again naively, I would have wanted the multi-index in the third table to share dimensions (x and y) from the previous data variables
Currently my solution is just to rename each of the dimensions in each respective data array so that they do not overlap. While this is not ideal, I can probably get away with this, but since I would prefer the ability to share dimensions without adding in NaN values, is there another way to achieve this? (Im also assuming that I can still do joins later on using values within different dimension names.)
Beta Was this translation helpful? Give feedback.
All reactions