Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look at using new xarray.unify_chunks() function rather than the DataArray method we currently use #184

Open
agstephens opened this issue Sep 6, 2021 · 4 comments
Assignees

Comments

@agstephens
Copy link
Collaborator

agstephens commented Sep 6, 2021

  • clisops version: *
  • Python version: *
  • Operating System: *

Description

There is a new global level unify_chunks() function in xarray that we should check out. It should improve the logic.

http://xarray.pydata.org/en/stable/generated/xarray.unify_chunks.html#xarray.unify_chunks

@ellesmith88: please can you do some testing to confirm that you get the same output from: ds.unify_chunks() and
da = ds[var_id]; da.unify_chunks()

We use the latter in our clisops code, so it would be great if we can switch to the former. Please add one or two unit tests to confirm that they will be functionally equivalent. And if so, then please make the change. Thanks

@agstephens agstephens self-assigned this Sep 6, 2021
@agstephens agstephens assigned ellesmith88 and unassigned agstephens Nov 10, 2021
@ellesmith88
Copy link
Collaborator

I've just started looking at this and I'm a bit confused.

Currently we have:

def get_chunked_dataset(ds):
    da = get_da(ds)
    chunk_length = get_chunk_length(da)
    chunked_ds = ds.chunk({"time": chunk_length})
    da.unify_chunks()
    return chunked_ds

I don't think the line da.unify_chunks() is actually doing anything. It returns a <class 'xarray.core.dataarray.DataArray'> object, which we don't assign to a variable. Nothing changes with chunked_ds.

So I think we should change this anyway to:

def get_chunked_dataset(ds):
    chunk_length = get_chunk_length(ds)
    chunked_ds = ds.chunk({"time": chunk_length})
    chunked_ds = chunked_ds.unify_chunks()
    return chunked_ds

I've compared the methods in some tests

def test_unify_chunks_cmip5():
# DataArray unify chunks method
ds1 = _open(CMIP5_TAS)
da = get_da(ds1)
chunk_length = get_chunk_length(da)
chunked_ds1 = ds1.chunk({"time": chunk_length})
da.unify_chunks()
# Dataset unify chunks method
ds2 = _open(CMIP5_TAS)
chunk_length = get_chunk_length(ds2)
chunked_ds2 = ds1.chunk({"time": chunk_length})
chunked_ds2_unified = chunked_ds2.unify_chunks()
assert chunked_ds1.chunks == chunked_ds2.chunks
assert chunked_ds2.chunks == chunked_ds2_unified.chunks
def test_unify_chunks_cmip6():
# DataArray unify chunks method
ds1 = _open(CMIP6_TOS)
da = get_da(ds1)
chunk_length = get_chunk_length(da)
chunked_ds1 = ds1.chunk({"time": chunk_length})
da.unify_chunks()
# Dataset unify chunks method
ds2 = _open(CMIP6_TOS)
chunk_length = get_chunk_length(ds2)
chunked_ds2 = ds1.chunk({"time": chunk_length})
chunked_ds2_unified = chunked_ds2.unify_chunks()
assert chunked_ds1.chunks == chunked_ds2.chunks
assert chunked_ds2.chunks == chunked_ds2_unified.chunks

but have found that chunked_ds = chunked_ds.unify_chunks() doesn't change anything in these examples.

The definition of the function is 'Unify chunk size along all chunked dimensions of this Dataset.', so is it not doing anything because we're only chunking time?

@agstephens
Copy link
Collaborator Author

Thanks @ellesmith88, when I was using this and experimenting last year, I got the impression that calling ds[var_id].unify_chunks() would change the value of this property: ds[var_id].chunks - so it changed a property of the existing DataArray. Although I may have just been really confused. Please double-check whether this .chunks property does get changed by the old and new method discussed above.

@ellesmith88
Copy link
Collaborator

Hi @agstephens I've just been having another look at this. ds[var_id].chunks is changed by ds.chunk({"time": chunk_length}) but not changed by ds[var_id].unify_chunks() which returns a DataArray or chunked_ds2.unify_chunks() which returns a Dataset.

Also,chunk_length = get_chunk_length(da) returns different chunk lengths depending on whether it is calculated using da or ds, because it uses size = da.nbytes or size = ds.nbytes to calculate the chunk length. Which does it make most sense to use?

@agstephens
Copy link
Collaborator Author

Thanks @ellesmith88,
So it sounds like we should not use unify_chunks() at all - is that correct? If so, please just remove it.
Regarding calculating the chunk size, given that we are working with 1 variable (i.e. DataArray) at a time, I think we should probably be using size = da.nbytes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants