Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add schema for checksum files used in model repro CI checks #5

Merged
merged 5 commits into from
Feb 16, 2024

Conversation

jo-basevi
Copy link
Contributor

This PR is for adding checksum schema used in model reproducibility tests for ACCESS-OM2 model configs, in this PR: ACCESS-NRI/access-om2-configs#2

This is also a test of an implementation for schema versioning in this issue #4

@jo-basevi
Copy link
Contributor Author

Open questions:
So I wanted to have a schema url associated with each checksum file so it's straight forward to get it's corresponding schema. However I am unsure whether I should have a const field for it in the schema, as it can be generated in the code that generates these checksums. Note the format of URL is https://raw.githubusercontent.com/${ORGANISATION}/${REPO}/${TAG}/${PATH/TO/FILE}. This url will only work once a tag is pushed.

"schema": {
    "type": "string",
    "const": "https://raw.githubusercontent.com/ACCESS-NRI/schema/access-om2-checksums-v1.0/access-om2-checksums/access-om2-checksums-v1.0.json"
},

Similar thing I am unsure about is having a const for schema version. The version will be the git tag value assoicated with that schema version so e.g. access-om2-checksums-v1.0.

At this stage I've left the const fields out, as I think they are error prone but I can add them back in.

As there isn't anything access-om2 specific in the schema, yet, it could actually be more generic? Like model-output-checksums or something like that.

@CodeGat
Copy link
Contributor

CodeGat commented Feb 14, 2024

I reckon leaving the const fields out is fine. And as for the more generic fields - they would be good, but I don't think they need to be in there for the current release. I'd be happy to merge this now, potentially.

@aidanheerdegen
Copy link
Member

Is the $VERSION format specified anywhere? Are we following schemaver?

https://docs.snowplow.io/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver/

Are we just keeping the most recent major version as a file in the repo and relying on git version control to keep track of the file and using tags to refer to explicit minor versions of the schema?

@CodeGat
Copy link
Contributor

CodeGat commented Feb 15, 2024

As to your second question, I think I recall Jo saying that there would be a file per version in the repo. I think this tracks with what @harshula said about discoverability.

...
access-om2-schema
 |-- CHANGELOG.md
 |-- README.md
 |-- access-om2-checksums-v1.0.json
 |-- access-om2-checksums-v1.1.json
 `-- ...etc...

@aidanheerdegen
Copy link
Member

aidanheerdegen commented Feb 15, 2024

I think I recall Jo saying that there would be a file per version in the repo

The filename for any particular version is unique. So we don't need to use tags?

@aidanheerdegen
Copy link
Member

I had a change of heart about implementing a proper directory structure. I think we should start with any new schemas we add, and we can slot the existing schema into directories at a later date.

Here is a proposal, but suggestions/discussion is welcome

experiment/
└── reproducibility/
    └── model/
        └── access-om2/
            └── checksums/
                ├── access-om2-checksums-v1.0.json
                └── access-om2-checksums-v1.0.json

There is an argument that this might make schema less discoverable .. the utility probably increases as the number of schema increases.

To get the categorical juices flowing, schema.org's hierarchy might be useful

https://schema.org/docs/full.html

@CodeGat
Copy link
Contributor

CodeGat commented Feb 15, 2024

I feel like putting the model at the top of the hierarchy would be best. Because we can have a model without reproducibility, but no reproducibility without a model (am I explaining myself correctly? :D ). Kinda like, a model can have a bunch more subfolders than reproducibility can, and we might have to add multiple access-om2 folders later for different aspects not related to reproducibility, if we go with the above. Something like:

.
└── model/
    └── access-om2/
        └── experiment/
            └── reproducibility/
                ├── checksums/
                │   ├── access-om2-checksums_1-0-0.json
                │   └── access-om2-checksums_1-1-0.json
                └── performance/
                    └── etc...

@aidanheerdegen
Copy link
Member

I feel like putting the model at the top of the hierarchy would be best

You have convinced me. I agree. Make it so!
make-it-so-picard

@aidanheerdegen
Copy link
Member

I think we've decided to adopt schemaver in the absence of anything else and because it seems well thought out and designed.

If we wanted to have a schema server in the future then it might be prudent to organise it the way the snowplow folks have done.

They organise their schema like so

au.org.access-nri/
└── model/
    └── access-om2/
        └── experiment/
            └── reproducibility/
                ├── checksums/
                │   ├── 1-0-0.json
                │   ├── 1-1-1.json
                │   └── CHANGELOG
                └── performance/
                    └── etc...

e.g.

https://github.com/snowplow/iglu-central/blob/master/schemas/com.google.analytics/cookies/jsonschema/1-0-0

Maybe au.org.access-nri is implied?

So version string is1-0-1 not v1-0-1 or 1.0.1

Maybe we need a schema for the version string? (THAT IS A JOKE THAT ISN'T FUNNY)

@CodeGat
Copy link
Contributor

CodeGat commented Feb 15, 2024

I vibe with it. Will make those changes.

aidanheerdegen
aidanheerdegen previously approved these changes Feb 16, 2024
Copy link
Member

@aidanheerdegen aidanheerdegen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving so you can merge if you think it's ok, but I do, as always, have questions.

@aidanheerdegen
Copy link
Member

So according to this, the self stuff is an extension to json schema introduced by snowplow, and this is why their schema point to a snowplow URL, as that is the extended schema specification

https://snowplow.io/blog/introducing-self-describing-jsons/

I don't think we need to go that far just yet. We can add it in for later versions if necessary.

Copy link
Member

@aidanheerdegen aidanheerdegen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@CodeGat CodeGat merged commit 65b8247 into main Feb 16, 2024
@CodeGat CodeGat deleted the add-checksum-schemas branch February 16, 2024 03:49
@aidanheerdegen aidanheerdegen mentioned this pull request Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants