You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to use xarray with tensorflow to train a NN. After the end of each epoch, the data is shuffled along the time dimension. The netcdf file have (time,lat,lon,level) dimensions. I use a chunk size of 1024 on the time dimension, which has about 400k values. The problem I am having is that after the DataArray is shuffled, performance of the data pipeline drops significantly -- so I thought that rechunking the data after shuffling solves the problem. I say this because if I write the shuffled data to disk and read it back again (commented out code below) with a chunk size of 1024, it goes as fast as the un-shuffled data set. So my question is: why is rechunking slow compared to writing and reading the data from disk?
Here is the code I am using:
defon_epoch_end(self):
"Shuffle dataset at the end of epoch"ifself.shuffle==True:
# Get the Dask array containing the data valuesdask_data=self.data.data# Create a shuffled index array along the 'time' dimensionshuffled_indices=da.random.permutation(
dask_data.shape[0]
)
# Use Dask delayed computation to perform the shufflingshuffled_data=dask_data[shuffled_indices, :, :, :]
# shuffled_data = da.rechunk(shuffled_data, chunks={0: 1024})# Create a new DataArray with the shuffled dataself.data=xr.DataArray(
shuffled_data, coords=self.data.coords, dims=self.data.dims
)
self.data=self.data.chunk({"time": 1024})
## save data to file and read it agan#self.data.to_netcdf("save.nc")#ds = xr.open_dataset(# "save.nc",# chunks={"time": 1024},#)#first_variable_name = list(ds.variables)[4]#self.data = ds[first_variable_name]
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I am trying to use xarray with tensorflow to train a NN. After the end of each epoch, the data is shuffled along the time dimension. The netcdf file have
(time,lat,lon,level)
dimensions. I use a chunk size of 1024 on the time dimension, which has about 400k values. The problem I am having is that after theDataArray
is shuffled, performance of the data pipeline drops significantly -- so I thought that rechunking the data after shuffling solves the problem. I say this because if I write the shuffled data to disk and read it back again (commented out code below) with a chunk size of 1024, it goes as fast as the un-shuffled data set. So my question is: why is rechunking slow compared to writing and reading the data from disk?Here is the code I am using:
Beta Was this translation helpful? Give feedback.
All reactions