Rechunk data after shuffling #8420

dshawul · 2023-11-06T19:11:59Z

dshawul
Nov 6, 2023

I am trying to use xarray with tensorflow to train a NN. After the end of each epoch, the data is shuffled along the time dimension. The netcdf file have (time,lat,lon,level) dimensions. I use a chunk size of 1024 on the time dimension, which has about 400k values. The problem I am having is that after the DataArray is shuffled, performance of the data pipeline drops significantly -- so I thought that rechunking the data after shuffling solves the problem. I say this because if I write the shuffled data to disk and read it back again (commented out code below) with a chunk size of 1024, it goes as fast as the un-shuffled data set. So my question is: why is rechunking slow compared to writing and reading the data from disk?

Here is the code I am using:

    def on_epoch_end(self):
        "Shuffle dataset at the end of epoch"
        if self.shuffle == True:
            # Get the Dask array containing the data values
            dask_data = self.data.data

            # Create a shuffled index array along the 'time' dimension
            shuffled_indices = da.random.permutation(
                dask_data.shape[0]
            )

            # Use Dask delayed computation to perform the shuffling
            shuffled_data = dask_data[shuffled_indices, :,  :, :]

            # shuffled_data = da.rechunk(shuffled_data, chunks={0: 1024})

            # Create a new DataArray with the shuffled data
            self.data = xr.DataArray(
                shuffled_data, coords=self.data.coords, dims=self.data.dims
            )
            self.data = self.data.chunk({"time": 1024})
    
            ## save data to file and read it agan
            #self.data.to_netcdf("save.nc")
            #ds = xr.open_dataset(
            #    "save.nc",
            #    chunks={"time": 1024},
            #)
            #first_variable_name = list(ds.variables)[4]
            #self.data = ds[first_variable_name]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rechunk data after shuffling #8420

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Rechunk data after shuffling #8420

dshawul Nov 6, 2023

Replies: 0 comments

dshawul
Nov 6, 2023