Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with netcdf time:units with unexpected characters #296

Open
durack1 opened this issue Oct 8, 2022 · 5 comments
Open

Dealing with netcdf time:units with unexpected characters #296

durack1 opened this issue Oct 8, 2022 · 5 comments

Comments

@durack1
Copy link

durack1 commented Oct 8, 2022

I've found a delightful edge case that is a little hard to believe. It involves a netcdf time:units that includes a character outside of the [0-9,-] range. If it's not obvious from the below, the issue is that the time:units = "days since 20O1-1-1" whereas this should be time:units = "days since 2001-1-1" (so replacing the rogue "O" (oooh), with the numeral "0" zero).

The file is a 297MiB file downloadable from here

Below is the example reproducing the error:

Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import xarray as xr
>>> xr.__version__
'2022.6.0'
>>> import cftime
>>> import cftime as cft
>>> cft.__version__
'1.6.1'
>>> ds = xr.open_dataset("/p/css03/esgf_publish/cmip3/ipcc/data3/sresa2/ice/mo/sic/ingv_echam4/run1/sic_O1.nc")
Traceback (most recent call last):
  File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/coding/times.py", line 270, in decode_cf_datetime
    dates = _decode_datetime_with_pandas(flat_num_dates, units, calendar)
  File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/coding/times.py", line 207, in _decode_datetime_with_pandas
    raise OutOfBoundsDatetime(
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Cannot decode times from a non-standard calendar, '360_day', using pandas.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/coding/times.py", line 180, in _decode_cf_datetime_dtype
    result = decode_cf_datetime(example_value, units, calendar, use_cftime)
  File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/coding/times.py", line 272, in decode_cf_datetime
    dates = _decode_datetime_with_cftime(
  File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/coding/times.py", line 201, in _decode_datetime_with_cftime
    cftime.num2date(num_dates, units, calendar, only_use_cftime_datetimes=True)
  File "src/cftime/_cftime.pyx", line 549, in cftime._cftime.num2date
  File "src/cftime/_cftime.pyx", line 107, in cftime._cftime._dateparse
  File "src/cftime/_cftime.pyx", line 750, in cftime._cftime._parse_date
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/backends/api.py", line 531, in open_dataset
    backend_ds = backend.open_dataset(
  File "~mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/backends/netCDF4_.py", line 569, in open_dataset
    ds = store_entrypoint.open_dataset(
  File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/backends/store.py", line 29, in open_dataset
    vars, attrs, coord_names = conventions.decode_cf_variables(
  File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/conventions.py", line 521, in decode_cf_variables
    new_vars[k] = decode_cf_variable(
  File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/conventions.py", line 369, in decode_cf_variable
    var = times.CFDatetimeCoder(use_cftime=use_cftime).decode(var, name=name)
  File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/coding/times.py", line 682, in decode
    dtype = _decode_cf_datetime_dtype(data, units, calendar, self.use_cftime)
  File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/coding/times.py", line 190, in _decode_cf_datetime_dtype
    raise ValueError(msg)
ValueError: unable to decode time units 'days since 20O1-1-1' with "calendar '360_day'". Try opening your dataset with decode_times=False or installing cftime if it is not installed.

I wonder if a regex check would be useful to implement? This problem tripped me up for a while, and it was not at all obvious that an incorrect character (which looks almost identical, depending on fonts) was the root cause. Testing for a datestring that matches regex r"(?:[0-9][0-9])?[0-9][0-9]-(?:[0-1])?[0-9]-(?:[0-3])?[0-9]" could be a useful test to catch such a fringe case - and point out the issue obviously in the error message. It seems in the CF Conventions docs that there is little leeway in this format, so using "/" or alternative MM-DD-YYYY formats to the standard [YYY]Y-[M]M-[D]D HH:MM:SS.ss [-]0:00

And just because pydata/xarray#7144

@durack1 durack1 changed the title Dealing with highly unusual netcdf time:units with unexpected characters Dealing with netcdf time:units with unexpected characters Oct 8, 2022
@jswhit
Copy link
Collaborator

jswhit commented Oct 11, 2022

A PR that catches this, and adds more leeway in the formatting (consistent with CF) would be welcome.

@durack1
Copy link
Author

durack1 commented Oct 11, 2022

@jswhit I am a little confused how the regex following the CF standards at https://github.com/Unidata/cftime/blob/master/src/cftime/_cftime.pyx#L45-L47 isn't catching this, it's an out-of-bounds character

I'm unfamiliar with this library, so would need some pointers to get started

@jswhit
Copy link
Collaborator

jswhit commented Oct 11, 2022

Sorry, can't be of much help - as the comment says this regex was lifted from http://delete.me.uk/2005/03/iso8601.html but apparently that link no longer exists. I know almost nothing about regexes.

@davidhassell
Copy link
Contributor

Hi - I've looked at this regex in the (dim and distant) past - happy to have take a look here, if that's OK.

The quick test below shows the regex passing in the "letter O" case, but has too many Nones in its matched groups. I'll have a look at whether the regex could/should be updated, or if a check in cpdef _parse_date(datestring) (https://github.com/Unidata/cftime/blob/v1.6.2rel/src/cftime/_cftime.pyx#L750), which is where the regex is used, might be an alternative.

>>> import re
>>> ISO8601_REGEX = re.compile(r"(?P<year>[+-]?[0-9]+)(-(?P<month>[0-9]{1,2})(-(?P<day>[0-9]{1,2})"
                           r"(((?P<separator1>.)(?P<hour>[0-9]{1,2}):(?P<minute>[0-9]{1,2})(:(?P<second>[0-9]{1,2})(\.(?P<fraction>[0-9]+))?)?)?"
                           r"((?P<separator2>.?)(?P<timezone>Z|(([-+])([0-9]{2})((:([0-9]{2}))|([0-9]{2}))?)))?)?)?)?"
                           )  # From https://github.com/Unidata/cftime/blob/v1.6.2rel/src/cftime/_cftime.pyx#L45-L48
>>> ISO8601_REGEX.match('2001-01-01').groups()  # All numbers
('2001',
 '-01-01',
 '01',
 '-01',
 '01',
 '',
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None)
>>> ISO8601_REGEX.match('20O1-01-01').groups()  # Letter O
('20',
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None)

@durack1
Copy link
Author

durack1 commented Oct 12, 2022

Thanks @davidhassell! I hit some weird duplication issue in more simple regex I was using, it validated correctly, but then seemed to cycle over things again and again - https://regex101.com/ was really helpful in pointing that out to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants