Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update xcdat_open() #1212

Merged
merged 4 commits into from
Dec 23, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 59 additions & 4 deletions pcmdi_metrics/io/xcdat_openxml.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,11 @@
import xcdat as xc
import xmltodict

from pcmdi_metrics.io.xcdat_dataset_io import get_calendar


def xcdat_open(
infile: Union[str, list], data_var: str = None, decode_times: bool = True
infile: Union[str, list], data_var: str = None, decode_times: bool = True, chunks={}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@acordonez thank you for the PR! It looks good to me in general.

This may not a big deal but I wonder if this could be chunks=None to be consistent to the default of underline function that is xarray.open_mfdataset

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lee1043 The extremes and drcdm metrics need to be able to specify the chunks to ensure that the time axis is continuous across a single chunk.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lee1043 Would chunks=None mean that no chunking is used by default? That might be fine?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lee1043 The extremes and drcdm metrics need to be able to specify the chunks to ensure that the time axis is continuous across a single chunk.

That sounds like a good reason to keep the PR as it is. Thanks for the comment!

Copy link
Contributor

@lee1043 lee1043 Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@acordonez I just found that chunks={} as default causes error in modes of variability code by opening dataset with dask chunks. I can make modes of variability as a special case by having chunks=None when using xcdat_open but haven't tested other metrics. If dask chunks are needed for only a few metrics, how about setting the default as chunks=None while in those special cases use xcdat_open with chunks={}?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lee1043 That suggestions sounds good to me.

) -> xr.Dataset:
"""
Open input file (netCDF, or xml generated by cdscan)
Expand All @@ -24,6 +26,8 @@ def xcdat_open(
decode_times : bool, optional
If True, attempt to decode times encoded in the standard NetCDF datetime format into cftime.datetime objects.
Otherwise, leave them encoded as numbers. This keyword may not be supported by all the backends, by default True.
chunks : int, "auto", dict, or None, optional
The chunk size used to load data into dask arrays.

Returns
-------
Expand All @@ -45,16 +49,67 @@ def xcdat_open(
>>> ds = xcdat_open('mydata.xml')
"""
if isinstance(infile, list) or "*" in infile:
ds = xc.open_mfdataset(infile, data_var=data_var, decode_times=decode_times)
try:
ds = xc.open_mfdataset(
infile, data_var=data_var, decode_times=decode_times, chunks=chunks
)
except (
ValueError
): # Could be due to non-cf-compliant calendar or other attribute
ds = xc.open_mfdataset(
infile, data_var=data_var, decode_times=False, chunks=chunks
)
ds = fix_noncompliant_attr(ds)
else:
if infile.split(".")[-1].lower() == "xml":
ds = _xcdat_openxml(infile, data_var=data_var, decode_times=decode_times)
try:
ds = _xcdat_openxml(
infile, data_var=data_var, decode_times=decode_times, chunks=chunks
)
except (
ValueError
): # Could be due to non-cf-compliant calendar or other attribute
ds = _xcdat_openxml(
infile, data_var=data_var, decode_times=False, chunks=chunks
)
ds = fix_noncompliant_attr(ds)
else:
ds = xc.open_dataset(infile, data_var=data_var, decode_times=decode_times)
try:
ds = xc.open_dataset(
infile, data_var=data_var, decode_times=decode_times, chunks=chunks
)
except (
ValueError
): # Could be due to non-cf-compliant calendar or other attribute
ds = xc.open_dataset(
infile, data_var=data_var, decode_times=False, chunks=chunks
)
ds = fix_noncompliant_attr(ds)

return ds.bounds.add_missing_bounds()


def fix_noncompliant_attr(ds: xr.Dataset) -> xr.Dataset:
"""Fix dataset attributes that do not meet cf standards

Parameters
----------
ds: xr.Dataset
xarray dataset to fix

Returns
-------
xr.Dataset
xarray dataset with updated attributes
"""
# Add any calendar fixes here
cal = get_calendar(ds)
cal = cal.replace("-", "_")
ds.time.attrs["calendar"] = cal
ds = xc.decode_time(ds)
return ds


def _xcdat_openxml(
xmlfile: str, data_var: str = None, decode_times: bool = True
) -> xr.Dataset:
Expand Down
Loading