Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CMIP6] Integrate check-up with ESGF Errata Service #77

Open
Zeitsperre opened this issue Sep 7, 2022 · 4 comments
Open

[CMIP6] Integrate check-up with ESGF Errata Service #77

Zeitsperre opened this issue Sep 7, 2022 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@Zeitsperre
Copy link
Collaborator

For CMIP6, there are numerous issues that require tracking and follow-up. Thankfully, the ESGF maintains both an online database of issues (https://errata.es-doc.org/static/index.html) and an API to query this database (https://es-doc.github.io/esdoc-errata-client/api.html)

There isn't really a lightweight Python-based method of verifying that a file does not have issues, but I can imagine a very easy means of cobbling something together:

  • Decode the facets from a given file
  • Construct a URL based on the Data Reference Syntax for the associated file/project
    • e.g. CMIP6 - CMIP6.CMIP.IPSL.IPSL-CM6A-LR.1pctCO2.r1i1p1f1.Omon.si.gn#20180727
  • Send a request to the ESGF official errata database (https://errata.es-doc.org/1/resolve/simple-pid?datasets=XX.YY.ZZ)
    • Returns a JSON with hasErrata field (boolean)
  • If errata are found, populate a list of files that require re-download

I don't think this would even require the esgissue command-line client that is offered for this purpose. (My apprehensions concerning extending support to esgissue would be that the client is much more powerful than needed for miranda - Allows for creating and resolving issues, requires GitHub access token, etc.).

Given that we are currently maintaining a database of CMIP6 and that there are more than a few errors to date, there is clearly a need for this functionality.

@Zeitsperre Zeitsperre added the enhancement New feature or request label Sep 7, 2022
@Zeitsperre Zeitsperre self-assigned this Sep 7, 2022
@huard
Copy link
Collaborator

huard commented Sep 8, 2022

I think you could use directly the persistent identifier instead of reconstructing the dataset string from the DRS.

Each CMIP6 file has a tracking_id global attribute that you can feed to the errata service API:

https://errata.es-doc.org/1/resolve/pid?pids=hdl:21.14100/15e49fc9-de86-433a-908d-6ae578491e27

You probably would then need to use the version attribute to check for specific errata, and to see if a new version is available for download.

@huard
Copy link
Collaborator

huard commented Sep 8, 2022

Does xscen log the tracking_id of input files? I suggest this is a good practice to implement. In the IPCC AR6, tracking the CMIP6 source files accurately has proven more difficult than anticipated, despite the existence of these pids.

@huard
Copy link
Collaborator

huard commented Sep 8, 2022

This is the API to resolve pid handles: http://hdl.handle.net/

@Zeitsperre
Copy link
Collaborator Author

Does xscen log the tracking_id of input files?

No, I don't believe it does. xscen constructs catalogues based on folder-tree structures. But the fact that xscen and the database management utilities here build on each other, I think there's need to delineate the "responsibilities" between these tools.

Will look into the PID approach. Thanks for the suggestions, @huard!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants