Purpose, motivation and implementation #1

aidanheerdegen · 2023-04-18T05:49:35Z

First up, props to @dougiesquire for starting the ball rolling.

What is the purpose of this repo?

Central location for all schema information for the ACCESS-NRI organisation.

Why?

Consistent approach across ACCESS-NRI organisation

Improves productivity as there is only one source of truth to find schema. Reduces the barrier for those new to the subject area who are not in a position to create their own schema due to lack of background knowledge.

Also naturally leads to interoperability: if everyone uses the same schema they re-use and connect with existing schema, which get the same connectivity "for free".

Such interconnected schema enables building knowledge graphs. A knowledge graph, or semantic network, is a graph based representation of the connections between objects contained in the schema. A knowledge graph can facilitate traversing data in novel ways that were previously unknown.

Knowledge graphs are a sort of ad hoc ontology.

Discoverability

Adding schema to webpages in json-ld format promotes discovery. It is the standard for semantic searching and cataloguing on the web.

This can lead to connections with other data providers, which adds value with little specific effort.

aidanheerdegen · 2023-04-18T07:08:10Z

How?

Format

The standard for schema on the web is RDF. That is what is used by schema.org and Bioschemas. Bioschemas is probably the one we should be following most closely.

An example schema is Bioschema Dataset, and an example record is

{
    "@context": "https://schema.org/",
    "@type": "Dataset",
    "http://purl.org/dc/terms/conformsTo": { "@type": "CreativeWork", "@id": "https://bioschemas.org/profiles/Dataset/1.0-RELEASE" },
    "@id": "https://doi.org/10.5281/zenodo.5743204",
    "identifier": "10.5281/zenodo.5743204",
    "name": "RDF version of the data from Choi, JS. et al. Towards a generalized toxicity prediction model for oxide nanomaterials using integrated data from different sources (2018)",
    "description": "This is an RDFied version of the dataset published in Choi, JS., Ha, M.K., Trinh, T.X. et al. Towards a generalized toxicity prediction model for oxide nanomaterials using integrated data from different sources. Sci Rep 8, 6110 (2018). The original dataset publication DOI: https://doi.org/10.1038/s41598-018-24483-z. The Original publication authors: Jang-Sik Choi, My Kieu Ha, Tung Xuan Trinh, Tae Hyun Yoon & Hyung-Gi Byun",
    "license": "https://creativecommons.org/licenses/by/4.0/legalcode",
    "url": "https://zenodo.org/record/5743204",
    "keywords": "oxide, nanomaterial, toxicity, prediction",
    "creator": [
      {
        "@type": "Organization",
        "name": "NanoSolveIT"
      }
    ],
    "datePublished": "2021-11-30",
    "citation": { "@type": "CreativeWork", "@id": "https://doi.org/10.1038/s41598-018-24483-z", "name": "Towards a generalized toxicity prediction model for oxide nanomaterials using integrated data from different sources" }
  }

Other use cases

This all very well, but how does this map to relational databases like the ones typically used for data indexing?

In a general sense not very well. However, the reverse mapping, from SQL DB to RDF is more straightforward.

If we're playing mostly in the relational DB/SQL space and so want the RDF mapping for interoperability with the wider world then that will affect how complex we let the schemas become. Or we have a strict hierarchy of schema, with a tighter definition at the bottom, which is interoperable with SQL, and higher level schema with more freedom that allow for more connectivity.

dougiesquire · 2023-04-18T10:04:27Z

Thanks for providing these details and context @aidanheerdegen

(Possibly) relevant climate-data examples

METACLIP is a framework for keeping track of the provenance of climate data products. This uses the W3C PROV model for provenance interchange on the web. See http://www.metaclip.org/about and https://www.sciencedirect.com/science/article/abs/pii/S1364815218305036
ESMValTool has provenance logging that also uses W3C PROV. See https://docs.esmvaltool.org/en/latest/community/diagnostic.html#recording-provenance
the rook package (which allows remote access to climate data) also uses W3C PROV for provenance. See https://rook-wps.readthedocs.io/en/latest/prov.html
Also looks interesting: https://gitlab.dkrz.de/data-infrastructure-services/climate_data_provenance

aidanheerdegen changed the title ~~Purpose and motivation~~ Purpose, motivation and implementation Apr 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Purpose, motivation and implementation #1

Purpose, motivation and implementation #1

aidanheerdegen commented Apr 18, 2023

aidanheerdegen commented Apr 18, 2023

dougiesquire commented Apr 18, 2023 •

edited

Loading

Purpose, motivation and implementation #1

Purpose, motivation and implementation #1

Comments

aidanheerdegen commented Apr 18, 2023

What is the purpose of this repo?

Why?

Consistent approach across ACCESS-NRI organisation

Discoverability

aidanheerdegen commented Apr 18, 2023

How?

Format

Other use cases

dougiesquire commented Apr 18, 2023 • edited Loading

(Possibly) relevant climate-data examples

dougiesquire commented Apr 18, 2023 •

edited

Loading