Look at using new `xarray.unify_chunks()` function rather than the DataArray method we currently use #184

agstephens · 2021-09-06T08:01:15Z

clisops version: *
Python version: *
Operating System: *

Description

There is a new global level unify_chunks() function in xarray that we should check out. It should improve the logic.

http://xarray.pydata.org/en/stable/generated/xarray.unify_chunks.html#xarray.unify_chunks

@ellesmith88: please can you do some testing to confirm that you get the same output from: ds.unify_chunks() and
da = ds[var_id]; da.unify_chunks()

We use the latter in our clisops code, so it would be great if we can switch to the former. Please add one or two unit tests to confirm that they will be functionally equivalent. And if so, then please make the change. Thanks

The text was updated successfully, but these errors were encountered:

ellesmith88 · 2021-11-23T12:26:20Z

I've just started looking at this and I'm a bit confused.

Currently we have:

def get_chunked_dataset(ds):
    da = get_da(ds)
    chunk_length = get_chunk_length(da)
    chunked_ds = ds.chunk({"time": chunk_length})
    da.unify_chunks()
    return chunked_ds

I don't think the line da.unify_chunks() is actually doing anything. It returns a <class 'xarray.core.dataarray.DataArray'> object, which we don't assign to a variable. Nothing changes with chunked_ds.

So I think we should change this anyway to:

def get_chunked_dataset(ds):
    chunk_length = get_chunk_length(ds)
    chunked_ds = ds.chunk({"time": chunk_length})
    chunked_ds = chunked_ds.unify_chunks()
    return chunked_ds

I've compared the methods in some tests

clisops/tests/test_output_utils.py

Lines 214 to 247 in d2e1929

    
           def test_unify_chunks_cmip5(): 
        
               # DataArray unify chunks method 
        
               ds1 = _open(CMIP5_TAS) 
        
               da = get_da(ds1) 
        
               chunk_length = get_chunk_length(da) 
        
               chunked_ds1 = ds1.chunk({"time": chunk_length}) 
        
               da.unify_chunks() 
        
               # Dataset unify chunks method 
        
               ds2 = _open(CMIP5_TAS) 
        
               chunk_length = get_chunk_length(ds2) 
        
               chunked_ds2 = ds1.chunk({"time": chunk_length}) 
        
               chunked_ds2_unified = chunked_ds2.unify_chunks() 
        
               assert chunked_ds1.chunks == chunked_ds2.chunks 
        
               assert chunked_ds2.chunks == chunked_ds2_unified.chunks 
        
           def test_unify_chunks_cmip6(): 
        
               # DataArray unify chunks method 
        
               ds1 = _open(CMIP6_TOS) 
        
               da = get_da(ds1) 
        
               chunk_length = get_chunk_length(da) 
        
               chunked_ds1 = ds1.chunk({"time": chunk_length}) 
        
               da.unify_chunks() 
        
               # Dataset unify chunks method 
        
               ds2 = _open(CMIP6_TOS) 
        
               chunk_length = get_chunk_length(ds2) 
        
               chunked_ds2 = ds1.chunk({"time": chunk_length}) 
        
               chunked_ds2_unified = chunked_ds2.unify_chunks() 
        
               assert chunked_ds1.chunks == chunked_ds2.chunks 
        
               assert chunked_ds2.chunks == chunked_ds2_unified.chunks

but have found that chunked_ds = chunked_ds.unify_chunks() doesn't change anything in these examples.

The definition of the function is 'Unify chunk size along all chunked dimensions of this Dataset.', so is it not doing anything because we're only chunking time?

agstephens · 2021-11-24T14:13:03Z

Thanks @ellesmith88, when I was using this and experimenting last year, I got the impression that calling ds[var_id].unify_chunks() would change the value of this property: ds[var_id].chunks - so it changed a property of the existing DataArray. Although I may have just been really confused. Please double-check whether this .chunks property does get changed by the old and new method discussed above.

ellesmith88 · 2021-11-30T11:40:47Z

Hi @agstephens I've just been having another look at this. ds[var_id].chunks is changed by ds.chunk({"time": chunk_length}) but not changed by ds[var_id].unify_chunks() which returns a DataArray or chunked_ds2.unify_chunks() which returns a Dataset.

Also,chunk_length = get_chunk_length(da) returns different chunk lengths depending on whether it is calculated using da or ds, because it uses size = da.nbytes or size = ds.nbytes to calculate the chunk length. Which does it make most sense to use?

agstephens · 2021-11-30T17:14:38Z

Thanks @ellesmith88,
So it sounds like we should not use unify_chunks() at all - is that correct? If so, please just remove it.
Regarding calculating the chunk size, given that we are working with 1 variable (i.e. DataArray) at a time, I think we should probably be using size = da.nbytes

agstephens self-assigned this Sep 6, 2021

agstephens assigned ellesmith88 and unassigned agstephens Nov 10, 2021

ellesmith88 mentioned this issue Dec 2, 2021

Remove da.unify_chunks() #209

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Look at using new `xarray.unify_chunks()` function rather than the DataArray method we currently use #184

Look at using new `xarray.unify_chunks()` function rather than the DataArray method we currently use #184

agstephens commented Sep 6, 2021 •

edited

Loading

ellesmith88 commented Nov 23, 2021

agstephens commented Nov 24, 2021

ellesmith88 commented Nov 30, 2021

agstephens commented Nov 30, 2021

Look at using new xarray.unify_chunks() function rather than the DataArray method we currently use #184

Look at using new xarray.unify_chunks() function rather than the DataArray method we currently use #184

Comments

agstephens commented Sep 6, 2021 • edited Loading

Description

ellesmith88 commented Nov 23, 2021

agstephens commented Nov 24, 2021

ellesmith88 commented Nov 30, 2021

agstephens commented Nov 30, 2021

Look at using new `xarray.unify_chunks()` function rather than the DataArray method we currently use #184

Look at using new `xarray.unify_chunks()` function rather than the DataArray method we currently use #184

agstephens commented Sep 6, 2021 •

edited

Loading