Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

99 datasets failed in latest run #1

Open
ellesmith88 opened this issue May 27, 2021 · 8 comments
Open

99 datasets failed in latest run #1

ellesmith88 opened this issue May 27, 2021 · 8 comments

Comments

@ellesmith88
Copy link
Contributor

Description

On the latest run to create the intake catalogue, using the datasets listed at https://github.com/cp4cds/c3s_34g_qc_results/blob/release3/QC_Results/QC_passed_dataset_ids_latest.txt , 99 datasets failed to run successfully. The error output is included:
errors.txt

There were 4 different kinds of error:

KeyError: "Receive multiple variables for key 'longitude': ['longitude', 'lon']. Expected only one. Please pass a list ['longitude'] instead to get all variables matching 'longitude'."

This is the issue I mentioned: roocs/roocs-utils#61. So it looks like lon shouldn't have units=degrees_east and standard_name=longitude.

  1. a)
Exception: Latitude is not within expected bounds. The minimum and maximum are -79.22052001953125, 9.969209968386869e+36

This is also the same for longitude in files with this error. There doesn't seem to be a fill value set for latitude/longitude but the fill value is set as 1e+20 for the main variable. These all look to be NCAR datasets.

  1. b)
Exception: Latitude is not within expected bounds. The minimum and maximum are -79.22052001953125, 1.0000000150474662e+30

First mentioned here: cp4cds/c3s_34g_manifests#10. Also the same for longitude. Fill value for latitude and longitude is set as 1e+20. These are SNU.SAM0-UNICON datasets and one CAS.FGOALS-g3 dataset.

There is no mention of these issues on the errata service. When I tested opening these datasets with netCDF4, for 2(a) the fill values were set as 9e+26 and the maximum of the latitude/longitude was normal, so this is only an issue with xarray.
For 2(b) when opening with netCDF4, the fill values are set as 1e+20 and the max latitude/longitude was 1e+30 as well, so this is an issue with the data.

  1. For the HDF errors: I've checked the checksums and they all match up with whats on https://esgf-index1.ceda.ac.uk/search/cmip6-ceda/.

Investigating opening these with netCDF4:

  • /badc/cmip6/data/CMIP6/CMIP/AWI/AWI-CM-1-1-MR/historical/r1i1p1f1/Amon/va/gn/v20181218/va_Amon_AWI-CM-1-1-MR_historical_r1i1p1f1_gn_186601-186612.nc : Opening error with xarray and netcdf

  • /badc/cmip6/data/CMIP6/ScenarioMIP/MOHC/HadGEM3-GC31-LL/ssp126/r1i1p1f3/Amon/ts/gn/v20200114/ts_Amon_HadGEM3-GC31-LL_ssp126_r1i1p1f3_gn_205001-210012.nc: Just error opening with xarray - opens fine with netCDF

  • /badc/cmip6/data/CMIP6/ScenarioMIP/MOHC/HadGEM3-GC31-MM/ssp126/r1i1p1f3/Omon/tos/gn/v20200515/tos_Omon_HadGEM3-GC31-MM_ssp126_r1i1p1f3_gn_203001-204912.nc: Error getting values of lat/lon with xarray and netcdf

  • /badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r1i1p1f2/Amon/tas/gn/v20190406/tas_Amon_UKESM1-0-LL_historical_r1i1p1f2_gn_185001-194912.nc: Opening error with xarray and netcdf

  1. Data with time range 4029-01-16 12:00:00 to 4114-12-16 12:00:00, mentioned here: Dataset with time range 4029-01-16 12:00:00 ... 4114-12-16 12:00:00 cp4cds/c3s_34g_manifests#9. Only 3 datasets have this issue in those that I scanned.
@ellesmith88
Copy link
Contributor Author

I came across this issue in xarray pydata/xarray#2742 which might be related to 2 (a)

@agstephens
Copy link

Suggested issues we can fix:
(1) we can fix.
(2) If these are dimensions - then they cannot be fixed; if they are auxiliary coordinates: we can fix.
(3) @ellesmith88 double-check the results of the HDF errors
(4) we cannot fix, exclude this

In all cases:

@ellesmith88
Copy link
Contributor Author

Checking the HDF errors:

/badc/cmip6/data/CMIP6/CMIP/AWI/AWI-CM-1-1-MR/historical/r1i1p1f1/Amon/va/gn/v20181218/va_Amon_AWI-CM-1-1-MR_historical_r1i1p1f1_gn_186601-186612.nc : Opens fine with xarray and netCDF4

/badc/cmip6/data/CMIP6/ScenarioMIP/MOHC/HadGEM3-GC31-LL/ssp126/r1i1p1f3/Amon/ts/gn/v20200114/ts_Amon_HadGEM3-GC31-LL_ssp126_r1i1p1f3_gn_205001-210012.nc: Opens fine with xarray and netCDF4

/badc/cmip6/data/CMIP6/ScenarioMIP/MOHC/HadGEM3-GC31-MM/ssp126/r1i1p1f3/Omon/tos/gn/v20200515/tos_Omon_HadGEM3-GC31-MM_ssp126_r1i1p1f3_gn_203001-204912.nc: still error getting values of lat/lon with xarray (ds.latitude.values) and netcdf (ds['latitude'][:])

/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r1i1p1f2/Amon/tas/gn/v20190406/tas_Amon_UKESM1-0-LL_historical_r1i1p1f2_gn_185001-194912.nc: Opening error with xarray and netcdf

@agstephens
Copy link

@ellesmith88: Please re-check both files that had errors. I have copied them off quobyte and re-ingested the same files - which I think fixes them !!!!! I'm confused.

@ellesmith88
Copy link
Contributor Author

@agstephens they both work now!

@agstephens
Copy link

agstephens commented Jun 10, 2021

@ellesmith88: actions as agreed:

  • 1. Duplicate lat and/or lon definitions in file:
    • Contact data provider to confirm which coord variables should be tagged with the cf-attributes of standard_name and units.
    • Write a fix to remove from the relevant coord variables.
  • 2. Missing values in coordinate variables: If these are dimensions - then they cannot be fixed (see CF rule at: https://cfconventions.org/cf-conventions/cf-conventions.html#missing-data); if they are auxiliary coordinates: we can fix.
    • Write a fix for this.
  • 3. No action needed on item 3
  • 4. No action needed on item 4

@ellesmith88
Copy link
Contributor Author

ellesmith88 commented Jun 14, 2021

    • data providers not responded, but have been contacted.
    • All auxiliary coordinates. But can include check for this in fix anyway.

Fixes below:

def replace_lat_and_lon_fill_values(ds, **operands):

    # get the value to mask
    value = operands.get('value')
    # convert from string to number - value must be provided as a string (to work with elasticsearch)
    if isinstance(value, str):
        value = float(value)

    # gets latitude and longitude 
    lat = xu.get_coord_by_type(ds, 'latitude', ignore_aux_coords=False).name
    lon = xu.get_coord_by_type(ds, 'longitude', ignore_aux_coords=False).name

    # if they are coordinate variables - don't fix
    for coord_id in lat, lon:
        if ds.coords[coord_id].dims == (coord_id,):
            return ds
    
    # make nans
    # ds[lat] = ds[lat].where(ds[lat] != value)
    # ds[lon] = ds[lon].where(ds[lon] != value)

    # or change fill value and sest encoding for fill value
    ds[lat] = ds[lat].where(ds[lat] != value, 1e+20)
    ds[lon] = ds[lon].where(ds[lon] != value, 1e+20)
    ds[lat].encoding['_FillValue'] = 1e+20
    ds[lon].encoding['_FillValue'] = 1e+20

    return ds


def remove_var_attrs(ds, **operands):
    """
    :param ds: Xarray DataSet
    :param operands: sequence of arguments
    :return: Xarray Dataset

    Change the attributes of a variable.
    """
    var_id = operands.get("var_id")

    attrs = operands.get("attrs")
    for attr in operands.get("attrs"):
        ds[var_id].attrs.pop(attr)

    return ds

@ellesmith88
Copy link
Contributor Author

  • The data providers haven't replied concerning the duplicate lat and/or lon definitions in file

Fixes have been put into branches: https://github.com/roocs/dachar/compare/failed_dataset_fixes and https://github.com/roocs/daops/compare/failed_dataset_fixes

  • These fixes are to fix datasets that have failed when they are being scanned into the catalog, so they would somehow need to be fixed before this scan.
  • The fixes would need to be turned on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants