Replies: 7 comments 21 replies
-
@rabernat Great sleuthing. 👍 💯 This is another example of how much needed metadata sanity checks are. |
Beta Was this translation helpful? Give feedback.
-
These are matlab datetimes... (hi tom!) Ryan, does it look like creating the cftimeindex is really slow? |
Beta Was this translation helpful? Give feedback.
-
Creating the cftimeindex is insanely slow. I did not have the patience to even let it finish and interrupted after a few minutes. |
Beta Was this translation helpful? Give feedback.
-
My colleague @bpgreenwood is fixing the time variable. We were wondering, is there an optimal or 'native' time format that would give the best performance, or is it just that we need to not go outside the Timestamp-valid range? |
Beta Was this translation helpful? Give feedback.
-
For posterity, here are some notes regarding the performance when using cftime (though I think we've established that ideally cftime would not be used at all for this example). So the cftime performance is not xarray-related. It has to do with two things:
A while back I made a change to The upshot is that it further emphasizes the benefits of choosing a proximal reference date, at least when using cftime. In this particular example, choosing a proximal reference date affords a nearly 300x speedup:
Here I chose to limit the size of the problem, because decoding 15 million distant times with this approach seems like it would take over an hour. In contrast, with a proximal reference date it might take on the order of fifteen seconds (still long, but bearable). |
Beta Was this translation helpful? Give feedback.
-
I looked a little more on size of
Indeed, I think this is what was confusing me. (I'm still confused-- I guess integers in a netcdf file would be interpreted by xarray as Numpy int's? Maybe I should go read the xarray docs.) |
Beta Was this translation helpful? Give feedback.
-
Thank you all for your help. I'll share a Matlab function that I wrote that converts Matlab datenum array to 'ms since [user defined epoch]'. Tom, our dataset contains ~9 days of 20Hz data. We can either use:
% convert Matlab datenum timestamp array to 'ms since epoch' for use with NetCDF
% refer to https://github.com/pydata/xarray/discussions/6284#discussioncomment-2205538
% epoch - datenum scalar representing start of time period
% time - datenum array of timestamps
% int_type - 'uint32' or 'uint64' string defines time datatype that is returned
% returns uint32 or uint64 array of timestamps in 'ms since epoch' format
function [time_ms] = convert_time_ms(epoch,time,int_type)
fprintf(' * converting timestamps from Matlab datenum to NetCDF "ms since %s"\n',datestr(epoch,'YYYY-mm-ddTHH:MM:SS.fffZ'));
if any(time < epoch)
fprintf(' * warning: timestamps found prior to epoch; see convert_time_ms()\n');
end
% unit conversion using user specified integer data type
if strcmp(int_type,'uint32')
time_ms = uint32((time - epoch)*24*60*60*1000);
elseif strcmp(int_type,'uint64')
time_ms = uint64((time-epoch)*24*60*60*1000);
else
% handle unknown data type exception
fprintf(' * error: convert_time_ms() unknown data type\n');
keyboard;
end
% detect overflow
if (time_ms(end) == intmax(int_type))
fprintf(' * warning: timestamps exceed %s max range: [0,%u]\n',int_type,intmax(int_type));
end
end |
Beta Was this translation helpful? Give feedback.
-
Over in #1385 (comment), @jtomfarrar posted an example of a NetCDF file which was really slow to open unless one specifies
decode_times=False
. I spent some time debugging this, and I thought I would post my findings here.The file in question is here: https://drive.google.com/file/d/1-05bG2kF8wbvldYtDpZ3LYLyqXnvZyw1/view?usp=sharing
When we open it with
decode_times=False
it is very fast.If we don't specify that option, it takes forever and also produces the following warning
This was a 🚩 for me. Why would recent data from an instrument need anything but a totally standard datetime. So I looked at time variable values and metadata.
This did not make sense. If you believe these units, that would place us in the year 3971. This is out of the range of what pandas time indexes can hold, which was triggering the usage of cftime.
So next I tried removing an offset from the time coordinate
and then decoding again
Things are looking a lot better. We now have a datetime index for
time_20Hz
. However, I noticed thattime_1Hz
was not getting decoded at all. Looking deeper at the metadataI noticed that the wrong attribute was used for the units (
unit
instead ofunits
). Fixing this allows both variables to be decoded quickly and correctly.I think it is useful to step through this because it shows how a metadata problem can translate to a performance problem.
Beta Was this translation helpful? Give feedback.
All reactions