Add schema for checksum files used in model repro CI checks #5

jo-basevi · 2024-02-07T05:50:20Z

This PR is for adding checksum schema used in model reproducibility tests for ACCESS-OM2 model configs, in this PR: ACCESS-NRI/access-om2-configs#2

This is also a test of an implementation for schema versioning in this issue #4

jo-basevi · 2024-02-07T05:51:03Z

Open questions:
So I wanted to have a schema url associated with each checksum file so it's straight forward to get it's corresponding schema. However I am unsure whether I should have a const field for it in the schema, as it can be generated in the code that generates these checksums. Note the format of URL is https://raw.githubusercontent.com/${ORGANISATION}/${REPO}/${TAG}/${PATH/TO/FILE}. This url will only work once a tag is pushed.

"schema": {
    "type": "string",
    "const": "https://raw.githubusercontent.com/ACCESS-NRI/schema/access-om2-checksums-v1.0/access-om2-checksums/access-om2-checksums-v1.0.json"
},

Similar thing I am unsure about is having a const for schema version. The version will be the git tag value assoicated with that schema version so e.g. access-om2-checksums-v1.0.

At this stage I've left the const fields out, as I think they are error prone but I can add them back in.

As there isn't anything access-om2 specific in the schema, yet, it could actually be more generic? Like model-output-checksums or something like that.

CodeGat · 2024-02-14T23:32:58Z

I reckon leaving the const fields out is fine. And as for the more generic fields - they would be good, but I don't think they need to be in there for the current release. I'd be happy to merge this now, potentially.

aidanheerdegen · 2024-02-15T01:24:10Z

Is the $VERSION format specified anywhere? Are we following schemaver?

https://docs.snowplow.io/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver/

Are we just keeping the most recent major version as a file in the repo and relying on git version control to keep track of the file and using tags to refer to explicit minor versions of the schema?

CodeGat · 2024-02-15T01:30:43Z

As to your second question, I think I recall Jo saying that there would be a file per version in the repo. I think this tracks with what @harshula said about discoverability.

...
access-om2-schema
 |-- CHANGELOG.md
 |-- README.md
 |-- access-om2-checksums-v1.0.json
 |-- access-om2-checksums-v1.1.json
 `-- ...etc...

aidanheerdegen · 2024-02-15T06:08:29Z

I think I recall Jo saying that there would be a file per version in the repo

The filename for any particular version is unique. So we don't need to use tags?

aidanheerdegen · 2024-02-15T19:33:24Z

I had a change of heart about implementing a proper directory structure. I think we should start with any new schemas we add, and we can slot the existing schema into directories at a later date.

Here is a proposal, but suggestions/discussion is welcome

experiment/
└── reproducibility/
    └── model/
        └── access-om2/
            └── checksums/
                ├── access-om2-checksums-v1.0.json
                └── access-om2-checksums-v1.0.json

There is an argument that this might make schema less discoverable .. the utility probably increases as the number of schema increases.

To get the categorical juices flowing, schema.org's hierarchy might be useful

https://schema.org/docs/full.html

CodeGat · 2024-02-15T22:11:43Z

I feel like putting the model at the top of the hierarchy would be best. Because we can have a model without reproducibility, but no reproducibility without a model (am I explaining myself correctly? :D ). Kinda like, a model can have a bunch more subfolders than reproducibility can, and we might have to add multiple access-om2 folders later for different aspects not related to reproducibility, if we go with the above. Something like:

.
└── model/
    └── access-om2/
        └── experiment/
            └── reproducibility/
                ├── checksums/
                │   ├── access-om2-checksums_1-0-0.json
                │   └── access-om2-checksums_1-1-0.json
                └── performance/
                    └── etc...

aidanheerdegen · 2024-02-15T22:30:35Z

I feel like putting the model at the top of the hierarchy would be best

You have convinced me. I agree. Make it so!

…hemaver

aidanheerdegen · 2024-02-15T23:30:37Z

I think we've decided to adopt schemaver in the absence of anything else and because it seems well thought out and designed.

If we wanted to have a schema server in the future then it might be prudent to organise it the way the snowplow folks have done.

They organise their schema like so

au.org.access-nri/
└── model/
    └── access-om2/
        └── experiment/
            └── reproducibility/
                ├── checksums/
                │   ├── 1-0-0.json
                │   ├── 1-1-1.json
                │   └── CHANGELOG
                └── performance/
                    └── etc...

e.g.

https://github.com/snowplow/iglu-central/blob/master/schemas/com.google.analytics/cookies/jsonschema/1-0-0

Maybe au.org.access-nri is implied?

So version string is1-0-1 not v1-0-1 or 1.0.1

Maybe we need a schema for the version string? (THAT IS A JOKE THAT ISN'T FUNNY)

CodeGat · 2024-02-15T23:33:00Z

I vibe with it. Will make those changes.

…, removed requirement of tagging schema version

aidanheerdegen

I'm approving so you can merge if you think it's ok, but I do, as always, have questions.

au.org.access-nri/model/access-om2/experiment/reproducibility/checksums/1-0-0.json

aidanheerdegen · 2024-02-16T03:12:11Z

So according to this, the self stuff is an extension to json schema introduced by snowplow, and this is why their schema point to a snowplow URL, as that is the extended schema specification

https://snowplow.io/blog/introducing-self-describing-jsons/

I don't think we need to go that far just yet. We can add it in for later versions if necessary.

aidanheerdegen

LGTM

Add schema for checksum files used in model repro CI checks

6965980

Moved access-om2-checksums schema into folder structure, now using sc…

0ef7abc

…hemaver

CodeGat added 2 commits February 16, 2024 10:39

Added encompassing au.org.access-nri folder, shortened name of schema…

2cdf655

…, removed requirement of tagging schema version

README.md: Updated main README regarding version format

5c903b7

CodeGat requested a review from aidanheerdegen February 16, 2024 01:36

CodeGat assigned CodeGat and jo-basevi Feb 16, 2024

aidanheerdegen previously approved these changes Feb 16, 2024

View reviewed changes

au.org.access-nri/model/access-om2/experiment/reproducibility/checksums/1-0-0.json Show resolved Hide resolved

au.org.access-nri/model/access-om2/experiment/reproducibility/checksums/1-0-0.json Show resolved Hide resolved

1-0-0.json: Added $id field

bab4ceb

CodeGat dismissed aidanheerdegen’s stale review via bab4ceb February 16, 2024 03:19

aidanheerdegen approved these changes Feb 16, 2024

View reviewed changes

CodeGat merged commit 65b8247 into main Feb 16, 2024

CodeGat deleted the add-checksum-schemas branch February 16, 2024 03:49

aidanheerdegen mentioned this pull request Mar 1, 2024

Versioning schemas #4

Closed

jo-basevi mentioned this pull request May 14, 2024

Move access-om2 checksum schema to more general location #24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add schema for checksum files used in model repro CI checks #5

Add schema for checksum files used in model repro CI checks #5

jo-basevi commented Feb 7, 2024

jo-basevi commented Feb 7, 2024

CodeGat commented Feb 14, 2024 •

edited

Loading

aidanheerdegen commented Feb 15, 2024

CodeGat commented Feb 15, 2024

aidanheerdegen commented Feb 15, 2024 •

edited

Loading

aidanheerdegen commented Feb 15, 2024

CodeGat commented Feb 15, 2024 •

edited

Loading

aidanheerdegen commented Feb 15, 2024

aidanheerdegen commented Feb 15, 2024

CodeGat commented Feb 15, 2024

aidanheerdegen left a comment

aidanheerdegen commented Feb 16, 2024

aidanheerdegen left a comment

Add schema for checksum files used in model repro CI checks #5

Add schema for checksum files used in model repro CI checks #5

Conversation

jo-basevi commented Feb 7, 2024

jo-basevi commented Feb 7, 2024

CodeGat commented Feb 14, 2024 • edited Loading

aidanheerdegen commented Feb 15, 2024

CodeGat commented Feb 15, 2024

aidanheerdegen commented Feb 15, 2024 • edited Loading

aidanheerdegen commented Feb 15, 2024

CodeGat commented Feb 15, 2024 • edited Loading

aidanheerdegen commented Feb 15, 2024

aidanheerdegen commented Feb 15, 2024

CodeGat commented Feb 15, 2024

aidanheerdegen left a comment

Choose a reason for hiding this comment

aidanheerdegen commented Feb 16, 2024

aidanheerdegen left a comment

Choose a reason for hiding this comment

CodeGat commented Feb 14, 2024 •

edited

Loading

aidanheerdegen commented Feb 15, 2024 •

edited

Loading

CodeGat commented Feb 15, 2024 •

edited

Loading