From 5acfed0ab9ab3b5fd03e968c7dd700a68b27511e Mon Sep 17 00:00:00 2001 From: Heinrich Lukas Weil Date: Fri, 17 Nov 2023 12:03:37 +0100 Subject: [PATCH 01/35] xlsx: small adjustments to inputs and outputs --- ISA-XLSX.md | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/ISA-XLSX.md b/ISA-XLSX.md index 0f3afde..60abfbe 100644 --- a/ISA-XLSX.md +++ b/ISA-XLSX.md @@ -611,7 +611,7 @@ The content of the annotation table MUST be placed in an `xlsx table` whose name ## Inputs and Outputs -Each annotation table sheet MUST contain an `Input` and an `Output` column, which denote the Input and Output node of the `Process` node respectively. They MUST be formatted in the pattern `Input []` and `Output []`. +Each annotation table sheet MUST contain at most one `Input` and at most one `Output` column, which denote the Input and Output node of the `Process` node respectively. They MUST be formatted in the pattern `Input []` and `Output []`. `NodeTypes` MUST be one of the following: @@ -619,9 +619,7 @@ Each annotation table sheet MUST contain an `Input` and an `Output` column, whic - A `Sample` MUST be indicated with the node type `Sample Name`. -- An `Extract Material` MUST be indicated with the node type `Extract Name`. - -- A `Labeled Extract Material` MUST be indicated with the node type `Labeled Extract Name`. +- An `Extract Material` MUST be indicated with the node type `Material Name`. - An `Image File` MUST be indicated with the node type `Image File`. @@ -629,7 +627,7 @@ Each annotation table sheet MUST contain an `Input` and an `Output` column, whic - A `Derived Data File` MUST be indicated with the node type `Derived Data File`. -`Source Names`, `Sample Names`, `Extract Names` and `Labeled Extract Names` MUST be unique across an ARC. If two of these entities with the same name exist in the same ARC, they are considered the same entity. +`Source Names`, `Sample Names`, `Material Names` MUST be unique across an ARC. If two of these entities with the same name exist in the same ARC, they are considered the same entity. `Image File`, `Raw Data File` or `Derived Data File` node types MUST correspond to a relevant file location. From aeae73e32d2e89ddd72c4f6d1a8346bd12d15da2 Mon Sep 17 00:00:00 2001 From: Heinrich Lukas Weil Date: Thu, 21 Dec 2023 12:09:44 +0100 Subject: [PATCH 02/35] loosen up arc.cwl constraint from MUST to SHOULD --- ARC specification.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/ARC specification.md b/ARC specification.md index 8991509..d3a6ee5 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -72,7 +72,7 @@ Metadata that describe the characteristics of material and resources follow the - *Runs* capture data products (i.e., outputs of computational analyses) derived from assays, other runs, or study materials using workflows (located in the aforementioned *workflows* subdirectory). Each run is a collection of files, stored in the top-level `runs` subdirectory. It MUST be accompanied by a per-run CWL workflow description, stored in `.cwl` as further described [below](#run-description). -- *Top-level metadata and workflow description* tie together the elements of an ARC in the contexts of investigation and associated studies (in the ISA definition), captured in the file `isa.investigation.xlsx` in [ISA-XLSX format](#isa-xlsx-format), which MUST be present. Furthermore, top-level reproducibility information MUST be provided in the CWL `arc.cwl`, which also MUST exist. +- *Top-level metadata and workflow description* tie together the elements of an ARC in the contexts of investigation and associated studies (in the ISA definition), captured in the file `isa.investigation.xlsx` in [ISA-XLSX format](#isa-xlsx-format), which MUST be present. Furthermore, top-level reproducibility information SHOULD be provided in the CWL `arc.cwl`. All other files contained in an ARC (e.g., a `README.txt`, pre-print PDFs, additional annotation files) are referred to as *additional payload*, and MAY be located anywhere within the ARC structure. However, an ARC MUST be [reproducible](#reproducible-arcs) and [publishable](#shareable-and-publishable-arcs) even if these files are deleted. Further considerations on additional payload are described [below](#additional-payload). @@ -217,7 +217,7 @@ Note: The `investigation` file MUST follow the [ISA-XLSX investigation file specification](ISA-XLSX.md#investigation-file). -Furthermore, top-level reproducibility information MUST be provided in the CWL `arc.cwl`, which also MUST exist. +Furthermore, top-level reproducibility information SHOULD be provided in the CWL `arc.cwl`. #### Investigation and Study Metadata @@ -228,7 +228,7 @@ The study-level SHOULD define [ISA factors](https://isa-specs.readthedocs.io/en/ #### Top-Level Run Description -The file `arc.cwl` MUST exist at the root directory of each ARC. It describes which runs are executed (and specifically, their order) to (re)produce the computational outputs contained within the ARC. +The file `arc.cwl` SHOULD exist at the root directory of each ARC. It describes which runs are executed (and specifically, their order) to (re)produce the computational outputs contained within the ARC. `arc.cwl` MUST be a CWL v1.2 workflow description and adhere to the same requirements as [run descriptions](#run-description). In particular, references to study or assay data files, nested workflows MUST use relative paths. An optional file `arc.yml` MAY be provided to specify input parameters. From 7d5df0e6bd6a17c64e516a9d49c0feb2f1451a99 Mon Sep 17 00:00:00 2001 From: Heinrich Lukas Weil Date: Thu, 21 Dec 2023 16:41:53 +0100 Subject: [PATCH 03/35] add comment to annotation table --- ISA-XLSX.md | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/ISA-XLSX.md b/ISA-XLSX.md index 0f3afde..63db01f 100644 --- a/ISA-XLSX.md +++ b/ISA-XLSX.md @@ -23,6 +23,7 @@ This document describes the ISA Abstract Model reference implementation specifie - [Factors](#factors) - [Components](#components) - [Parameters](#parameters) + - [Comments](#comments) - [Examples](#examples) Below we provide the schemas and the content rules for valid ISA-XLSX documents. @@ -709,6 +710,14 @@ A `Parameter` can be used to specify any additional information about the experi |--------------------------------|--------|-------------------|------------------------------------------------------| | 300 | Kelvin | UO | [http://…/obo/UO_0000032](http://purl.obolibrary.org/obo/UO_0000032) | +## Comments + +A `Comment` can be used to provide some additional information. Columns headed with `Comment[]` MAY appear after any named node in the Annotation Table. The value MUST be free text. + +| Comment [Answer to everything] | +|--------------------------------| +| forty-two | + ## Others Columns whose headers do not follow any of the formats described above are considered additional payload and are out of the scope of this specification. From a9bd993b5e60e2da1a487c9885ed485d31541ef4 Mon Sep 17 00:00:00 2001 From: Heinrich Lukas Weil Date: Fri, 22 Dec 2023 14:21:30 +0100 Subject: [PATCH 04/35] add data path annotation section to ARC specification --- ARC specification.md | 32 ++++++++++++++++++++++++++++++++ ISA-XLSX.md | 2 +- 2 files changed, 33 insertions(+), 1 deletion(-) diff --git a/ARC specification.md b/ARC specification.md index 8991509..a2af33b 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -26,6 +26,7 @@ Licensed under the Creative Commons License CC BY, Version 4.0; you may not use - [Top-level Metadata and Workflow Description](#top-level-metadata-and-workflow-description) - [Investigation and Study Metadata](#investigation-and-study-metadata) - [Top-Level Run Description](#top-level-run-description) + - [Data Path Annotation](#data-path-annotation) - [Shareable and Publishable ARCs](#shareable-and-publishable-arcs) - [Reproducible ARCs](#reproducible-arcs) - [Mechanism for Quality Control of ARCs](#mechanism-for-quality-control-of-arcs) @@ -232,6 +233,37 @@ The file `arc.cwl` MUST exist at the root directory of each ARC. It describes wh `arc.cwl` MUST be a CWL v1.2 workflow description and adhere to the same requirements as [run descriptions](#run-description). In particular, references to study or assay data files, nested workflows MUST use relative paths. An optional file `arc.yml` MAY be provided to specify input parameters. +### Data Path Annotation + +All metadata references to files or directories located inside the ARC MUST follow the following patterns: + +- The `general pattern`, which is always applicable and SHOULD always be used is to specify the path relative to the ARC root + +- The `folder specific pattern`, which MAY be used. This pattern dependes on the metadata context: + - Data nodes in `isa.assay.xlsx` files: The path MAY be specified relative to the `dataset` sub-folder of the assay + - Data nodes in `isa.study.xlsx` files: The path MAY be specified relative to the `resources` sub-folder of the study + +#### Examples + +In this example, there are two `assays`, with `Assay1`containing a measurement of a `Source` material, producing an output `Raw Data file`. `Assay2` references this `Data file` for producing a new `Derived Data File` + +Use of `general pattern` relative paths from the arc root folder: + +`assays/Assay1/isa.assay.xlsx`: + +| Input [Source Name] | Parameter[Instrument model] | Output [Raw Data File] | +|-------------|---------------------------------|----------------------------------| +| input | Bruker 500 Avance | assays/Assay1/dataset/measurement.txt | + +`assays/Assay2/isa.assay.xlsx`: + +| Input [Raw Data File] | Parameter[script file] | Output [Derived Data File] | +|----------------------------------|---------------------------------|----------------------------------| +| assays/Assay1/dataset/measurement.txt | assays/Assay2/dataset/script.sh | assays/Assay2/dataset/result.txt | + + + + ## Shareable and Publishable ARCs ARCs can be shared in any state. They are considered *publishable* (e.g. for the purpose of minting a DOI) when fulfilling the following conditions: diff --git a/ISA-XLSX.md b/ISA-XLSX.md index 0f3afde..2de5658 100644 --- a/ISA-XLSX.md +++ b/ISA-XLSX.md @@ -631,7 +631,7 @@ Each annotation table sheet MUST contain an `Input` and an `Output` column, whic `Source Names`, `Sample Names`, `Extract Names` and `Labeled Extract Names` MUST be unique across an ARC. If two of these entities with the same name exist in the same ARC, they are considered the same entity. -`Image File`, `Raw Data File` or `Derived Data File` node types MUST correspond to a relevant file location. +`Image File`, `Raw Data File` or `Derived Data File` node types MUST correspond to a relevant file location, following the [Data Path Annotation](/ARC%20specification.md#data-path-annotation) patterns. ## Protocol Columns From 0d83c77bf0ef558f330cb1b01e85ca33c9a01ba9 Mon Sep 17 00:00:00 2001 From: Heinrich Lukas Weil Date: Sun, 24 Dec 2023 20:17:06 +0100 Subject: [PATCH 05/35] small change to data path annotation specification --- ARC specification.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ARC specification.md b/ARC specification.md index a2af33b..99c95ed 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -237,7 +237,7 @@ The file `arc.cwl` MUST exist at the root directory of each ARC. It describes wh All metadata references to files or directories located inside the ARC MUST follow the following patterns: -- The `general pattern`, which is always applicable and SHOULD always be used is to specify the path relative to the ARC root +- The `general pattern`, which is universally applicable and SHOULD be used is to specify the path relative to the ARC root - The `folder specific pattern`, which MAY be used. This pattern dependes on the metadata context: - Data nodes in `isa.assay.xlsx` files: The path MAY be specified relative to the `dataset` sub-folder of the assay From 8c2bd48b9e8407176f2f583280d40291d0b7a44a Mon Sep 17 00:00:00 2001 From: Heinrich Lukas Weil Date: Tue, 2 Jan 2024 10:30:42 +0100 Subject: [PATCH 06/35] clarify constraint for folder specific pattern --- ARC specification.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/ARC specification.md b/ARC specification.md index 99c95ed..4b3d04a 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -237,9 +237,9 @@ The file `arc.cwl` MUST exist at the root directory of each ARC. It describes wh All metadata references to files or directories located inside the ARC MUST follow the following patterns: -- The `general pattern`, which is universally applicable and SHOULD be used is to specify the path relative to the ARC root +- The `general pattern`, which is universally applicable and SHOULD be used is to specify the path relative to the ARC root. -- The `folder specific pattern`, which MAY be used. This pattern dependes on the metadata context: +- The `folder specific pattern`, which MAY be used only in specific metadata contexts: - Data nodes in `isa.assay.xlsx` files: The path MAY be specified relative to the `dataset` sub-folder of the assay - Data nodes in `isa.study.xlsx` files: The path MAY be specified relative to the `resources` sub-folder of the study From 0acd515b0be737001598f6be2eaaec5b5d8ae495 Mon Sep 17 00:00:00 2001 From: Kevin Schneider Date: Tue, 9 Jan 2024 16:57:19 +0100 Subject: [PATCH 07/35] fix a typo in the STUDY CONTACTS table section --- ISA-XLSX.md | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/ISA-XLSX.md b/ISA-XLSX.md index 0f3afde..6c260c1 100644 --- a/ISA-XLSX.md +++ b/ISA-XLSX.md @@ -477,22 +477,20 @@ For example, the `STUDY PROTOCOLS` section of an ISA-XLSX `isa.investigation.xls | Study Protocol Components Type Term Accession Number | http://purl.obolibrary.org/obo/NCIT_C68796 | | ;;http://purl.obolibrary.org/obo/MS_1002732 | Study Protocol Components Type Term Source REF | NCIT | | ;;MS - ### STUDY CONTACTS This section MUST contain zero or more values. This section MUST contain the following labels, with the specified datatypes for values supported: -| Label | Datatype | Description | +| Label | Datatype | Description | |------------------------------------------|---------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Study Person Last Name | String | The last name of a person associated with the study. | | Study Person First Name | String | Study Person Name | -| Study Person Mid Initials | String | The middle initials of a person associated with the study. -| +| Study Person Mid Initials | String | The middle initials of a person associated with the study. | | Study Person Email | String formatted as email | The email address of a person associated with the study. | | Study Person Phone | String | The telephone number of a person associated with the study. | -| IStudy Person Fax | String | The fax number of a person associated with the study. | +| Study Person Fax | String | The fax number of a person associated with the study. | | Study Person Address | String | The address of a person associated with the study. | | Study Person Affiliation | String | The organization affiliation for a person associated with the study. | | Study Person Roles | String or Ontology Annotation if accompanied by Term Accession Numbers and Term Source REFs | Term to classify the role(s) performed by this person in the context of the study, which means that the roles reported here need not correspond to roles held withing their affiliated organization. Multiple annotations or values attached to one person can be provided by using a semicolon (“;”) Unicode (U0003+B) as a separator (e.g.: submitter;funder;sponsor) .The term can be free text or from, for example, a controlled vocabulary or an ontology. If the latter source is used the Term Accession Number and Term Source REF fields below are required. | From ae969c2ac6d9ad400a84e66eb5c21ef37934e63a Mon Sep 17 00:00:00 2001 From: Heinrich Lukas Weil Date: Fri, 12 Jan 2024 09:30:36 +0100 Subject: [PATCH 08/35] fix typos in isa-xlsx --- ISA-XLSX.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/ISA-XLSX.md b/ISA-XLSX.md index f40e12b..25bdf3c 100644 --- a/ISA-XLSX.md +++ b/ISA-XLSX.md @@ -534,7 +534,7 @@ This section MUST contain the following labels, with the specified datatypes for | Label | Datatype | Description | |----------------------------------------------------|------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Assay Measurement Type | String | A term to qualify the endpoint, or what is being measured (e.g. gene expression profiling or protein identification). The term can be free text or from, for example, a controlled vocabulary or an ontology. If the latter source is used the Term Accession Number and Term Source REF fields below are required. | -| Study Assay Measurement Type Term Accession Number | String | The accession number from the Term Source associated with the selected term. | +| Assay Measurement Type Term Accession Number | String | The accession number from the Term Source associated with the selected term. | | Assay Measurement Type Term Source REF | String | The Source REF has to match one of the Term Source Name declared in the Ontology Source Reference section. | | Assay Technology Type | String | Term to identify the technology used to perform the measurement, e.g. DNA microarray, mass spectrometry. The term can be free text or from, for example, a controlled vocabulary or an ontology. If the latter source is used the Term Accession Number and Term Source REF fields below are required. | | Assay Technology Type Term Accession Number | String | The accession number from the Term Source associated with the selected term. | @@ -647,9 +647,9 @@ Each annotation table sheet MUST contain at most one `Input` and at most one `Ou Where a value is an `Ontology Annotation` in a table file, `Term Accession Number` and `Term Source REF` fields MUST follow the column cell in which the value is entered. These two columns SHOULD contain further ontological information about the header. In this case, following the static header string, separated by a single space, there MUST be a short ontology term identifier formatted as CURIEs (prefixed identifiers) of the form `:` (specified [here](http://obofoundry.org/id-policy)) inside `()` brackets. For example, a characteristic type `organism` with a value of `Homo sapiens` can be qualified with an `Ontology Annotation` of a term from NCBI Taxonomy as follows: -| Characteristics [organism] | Term Source REF (OBI:0100026) | Term Accession Number (OBI:0100026) | +| Characteristic [organism] | Term Source REF (OBI:0100026) | Term Accession Number (OBI:0100026) | |-----------------------------|-------------------|------------------------------------------------------| -| Homo sapiens | NCBITaxon | [http://…/NCBITAXON/9606](http://.../NCBITAXON/9606) | +| Homo sapiens | NCBITaxon | [http://…/NCBITAXON_9606](http://.../NCBITAXON_9606) | An `Ontology Annotation` MAY be applied to any appropriate `Characteristic`, `Parameter`, `Factor`, `Component` or `Protocol Type`. From 591c24c4b8f1113efe8417d18dfd68798ef7b63d Mon Sep 17 00:00:00 2001 From: Heinrich Lukas Weil Date: Tue, 23 Jan 2024 11:07:43 +0100 Subject: [PATCH 09/35] make annotation table comment refer only to table --- ISA-XLSX.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ISA-XLSX.md b/ISA-XLSX.md index 63db01f..c1798b2 100644 --- a/ISA-XLSX.md +++ b/ISA-XLSX.md @@ -712,7 +712,7 @@ A `Parameter` can be used to specify any additional information about the experi ## Comments -A `Comment` can be used to provide some additional information. Columns headed with `Comment[]` MAY appear after any named node in the Annotation Table. The value MUST be free text. +A `Comment` can be used to provide some additional information. Columns headed with `Comment[]` MAY appear anywhere in the Annotation Table. The comment always refers to the Annotation Table. The value MUST be free text. | Comment [Answer to everything] | |--------------------------------| From b33671035a40a03db8d646caae6702b359cb7e9a Mon Sep 17 00:00:00 2001 From: Heinrich Lukas Weil Date: Tue, 23 Jan 2024 17:48:25 +0100 Subject: [PATCH 10/35] start inclusion of data selectors with some links --- ISA-XLSX.md | 22 +++++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/ISA-XLSX.md b/ISA-XLSX.md index bf4caf6..82edff3 100644 --- a/ISA-XLSX.md +++ b/ISA-XLSX.md @@ -620,15 +620,27 @@ Each annotation table sheet MUST contain at most one `Input` and at most one `Ou - An `Extract Material` MUST be indicated with the node type `Material Name`. -- An `Image File` MUST be indicated with the node type `Image File`. +- A `Data` object MUST be indicated with the node type `Data`. -- A `Raw Data File` MUST be indicated with the node type `Raw Data File`. +`Source Names`, `Sample Names`, `Material Names` MUST be unique across an ARC. If two of these entities with the same name exist in the same ARC, they are considered the same entity. -- A `Derived Data File` MUST be indicated with the node type `Derived Data File`. +The `Data` node type MUST correspond to a relevant data resource location, following the [Data Path Annotation](/ARC%20specification.md#data-path-annotation) patterns. If the annotation of the `Data` node refers not to the complete resource, but a part of it, a `Selector` MAY added. This Selector MUST be separated from the location using a `#`— with no whitespace between: `location#selector`. -`Source Names`, `Sample Names`, `Material Names` MUST be unique across an ARC. If two of these entities with the same name exist in the same ARC, they are considered the same entity. +`Data Format` SHOULD be expressed using a [MIME format](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types), most commonly consisting of two parts: a type and a subtype, separated by a slash (/) — with no whitespace between: `type/subtype`. If appropriate, a format from the list composed by [IANA](https://www.iana.org/assignments/media-types/media-types.xhtml) +SHOULD be picked. Unregistered or niche encoding and file formats MAY be indicated instead via the most appropriate URL. + +## Examples + +In this example, there are two `assays`, with `Assay1`containing a measurement of a `Source` material, producing an output `Raw Data file`. `Assay2` references this `Data file` for producing a new `Derived Data File` + +Use of `general pattern` relative paths from the arc root folder: + +`assays/Assay1/isa.assay.xlsx`: -`Image File`, `Raw Data File` or `Derived Data File` node types MUST correspond to a relevant file location, following the [Data Path Annotation](/ARC%20specification.md#data-path-annotation) patterns. +| Input [Sample Name] | Output [Data] | Output Data Format | Output Data Selector | +|-------------|---------------------------------|----------------------------------|--| +| input1 | result.csv#col=1 | text/csv | https://datatracker.ietf.org/doc/html/rfc7111 | +| input2 | result.csv#col=2 | text/csv | https://datatracker.ietf.org/doc/html/rfc7111 | ## Protocol Columns From 91644edfd2d257d828899ff52f24dcf1a713b60f Mon Sep 17 00:00:00 2001 From: Heinrich Lukas Weil Date: Wed, 24 Jan 2024 09:27:38 +0100 Subject: [PATCH 11/35] finish up first draft of new data columns --- ISA-XLSX.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/ISA-XLSX.md b/ISA-XLSX.md index 82edff3..6f208a6 100644 --- a/ISA-XLSX.md +++ b/ISA-XLSX.md @@ -624,11 +624,15 @@ Each annotation table sheet MUST contain at most one `Input` and at most one `Ou `Source Names`, `Sample Names`, `Material Names` MUST be unique across an ARC. If two of these entities with the same name exist in the same ARC, they are considered the same entity. -The `Data` node type MUST correspond to a relevant data resource location, following the [Data Path Annotation](/ARC%20specification.md#data-path-annotation) patterns. If the annotation of the `Data` node refers not to the complete resource, but a part of it, a `Selector` MAY added. This Selector MUST be separated from the location using a `#`— with no whitespace between: `location#selector`. +The `Data` node type MUST correspond to a relevant data resource location, following the [Data Path Annotation](/ARC%20specification.md#data-path-annotation) patterns. If the annotation of the `Data` node refers not to the complete resource, but a part of it, a `Selector` MAY be added. This Selector MUST be separated from the resource location using a `#`— with no whitespace between: `location#selector`. If appropriate, the Selector SHOULD be formatted according to IRI fragment selectors specified by [W3](https://www.w3.org/TR/annotation-model/#fragment-selector). -`Data Format` SHOULD be expressed using a [MIME format](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types), most commonly consisting of two parts: a type and a subtype, separated by a slash (/) — with no whitespace between: `type/subtype`. If appropriate, a format from the list composed by [IANA](https://www.iana.org/assignments/media-types/media-types.xhtml) +The format of the data resource MAY be further qualified using a `Data Format` column. The `Data Format` SHOULD be expressed using a [MIME format](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types), most commonly consisting of two parts: a type and a subtype, separated by a slash (/) — with no whitespace between: `type/subtype`. If appropriate, a format from the list composed by [IANA](https://www.iana.org/assignments/media-types/media-types.xhtml) SHOULD be picked. Unregistered or niche encoding and file formats MAY be indicated instead via the most appropriate URL. +The format and usage info about the selector MAY be further qualified using a `Data Selector` column. The `Data Selector` SHOULD point to a web resource containing instructions about how the selector is formatted and how it should be interpreted. + + + ## Examples In this example, there are two `assays`, with `Assay1`containing a measurement of a `Source` material, producing an output `Raw Data file`. `Assay2` references this `Data file` for producing a new `Derived Data File` @@ -637,7 +641,7 @@ Use of `general pattern` relative paths from the arc root folder: `assays/Assay1/isa.assay.xlsx`: -| Input [Sample Name] | Output [Data] | Output Data Format | Output Data Selector | +| Input [Sample Name] | Output [Data] | Data Format | Data Selector | |-------------|---------------------------------|----------------------------------|--| | input1 | result.csv#col=1 | text/csv | https://datatracker.ietf.org/doc/html/rfc7111 | | input2 | result.csv#col=2 | text/csv | https://datatracker.ietf.org/doc/html/rfc7111 | From 6a4eefa1dc9afe20c1e87f73a86b1ed5cc824282 Mon Sep 17 00:00:00 2001 From: Heinrich Lukas Weil Date: Wed, 24 Jan 2024 09:52:09 +0100 Subject: [PATCH 12/35] adjust examples according to data specification changes --- ARC specification.md | 8 +++++--- ISA-XLSX.md | 8 +++----- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/ARC specification.md b/ARC specification.md index b4b61d5..1ac80d7 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -245,19 +245,21 @@ All metadata references to files or directories located inside the ARC MUST foll #### Examples -In this example, there are two `assays`, with `Assay1`containing a measurement of a `Source` material, producing an output `Raw Data file`. `Assay2` references this `Data file` for producing a new `Derived Data File` +##### General Pattern + +In this example, there are two `assays`, with `Assay1`containing a measurement of a `Source` material, producing an output `Data`. `Assay2` references this `Data` for producing a new `Data`. Use of `general pattern` relative paths from the arc root folder: `assays/Assay1/isa.assay.xlsx`: -| Input [Source Name] | Parameter[Instrument model] | Output [Raw Data File] | +| Input [Source Name] | Parameter[Instrument model] | Output [Data] | |-------------|---------------------------------|----------------------------------| | input | Bruker 500 Avance | assays/Assay1/dataset/measurement.txt | `assays/Assay2/isa.assay.xlsx`: -| Input [Raw Data File] | Parameter[script file] | Output [Derived Data File] | +| Input [Data] | Parameter[script file] | Output [Data] | |----------------------------------|---------------------------------|----------------------------------| | assays/Assay1/dataset/measurement.txt | assays/Assay2/dataset/script.sh | assays/Assay2/dataset/result.txt | diff --git a/ISA-XLSX.md b/ISA-XLSX.md index 6f208a6..2412e68 100644 --- a/ISA-XLSX.md +++ b/ISA-XLSX.md @@ -629,17 +629,15 @@ The `Data` node type MUST correspond to a relevant data resource location, follo The format of the data resource MAY be further qualified using a `Data Format` column. The `Data Format` SHOULD be expressed using a [MIME format](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types), most commonly consisting of two parts: a type and a subtype, separated by a slash (/) — with no whitespace between: `type/subtype`. If appropriate, a format from the list composed by [IANA](https://www.iana.org/assignments/media-types/media-types.xhtml) SHOULD be picked. Unregistered or niche encoding and file formats MAY be indicated instead via the most appropriate URL. -The format and usage info about the selector MAY be further qualified using a `Data Selector` column. The `Data Selector` SHOULD point to a web resource containing instructions about how the selector is formatted and how it should be interpreted. - +The format and usage info about the Selector MAY be further qualified using a `Data Selector` column. The `Data Selector` SHOULD point to a web resource containing instructions about how the Selector is formatted and how it should be interpreted. ## Examples -In this example, there are two `assays`, with `Assay1`containing a measurement of a `Source` material, producing an output `Raw Data file`. `Assay2` references this `Data file` for producing a new `Derived Data File` +### Data Location and Selector -Use of `general pattern` relative paths from the arc root folder: +In this example, there is a measurement of two `Samples`, namely `input1` and `input2`. The values measured are both written into the same data resource in the location `result.csv`, whichs formatting is tabular, according to the `Data Format` being `text/csv`. To distinguish between the measurement values stemming from the different inputs, selectors were added to the resource location (seperated by a `#`), namely `col=1` and `col=2`. The specification about the formatting of these selectors can be found in the provided link, namely `https://datatracker.ietf.org/`. -`assays/Assay1/isa.assay.xlsx`: | Input [Sample Name] | Output [Data] | Data Format | Data Selector | |-------------|---------------------------------|----------------------------------|--| From 1f6e246df6d7c1b98c5d4a3206340d130cde6a39 Mon Sep 17 00:00:00 2001 From: Heinrich Lukas Weil Date: Wed, 24 Jan 2024 10:12:08 +0100 Subject: [PATCH 13/35] change Data Selector to Data Selector Format --- ISA-XLSX.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/ISA-XLSX.md b/ISA-XLSX.md index 2412e68..04d8e8a 100644 --- a/ISA-XLSX.md +++ b/ISA-XLSX.md @@ -629,7 +629,7 @@ The `Data` node type MUST correspond to a relevant data resource location, follo The format of the data resource MAY be further qualified using a `Data Format` column. The `Data Format` SHOULD be expressed using a [MIME format](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types), most commonly consisting of two parts: a type and a subtype, separated by a slash (/) — with no whitespace between: `type/subtype`. If appropriate, a format from the list composed by [IANA](https://www.iana.org/assignments/media-types/media-types.xhtml) SHOULD be picked. Unregistered or niche encoding and file formats MAY be indicated instead via the most appropriate URL. -The format and usage info about the Selector MAY be further qualified using a `Data Selector` column. The `Data Selector` SHOULD point to a web resource containing instructions about how the Selector is formatted and how it should be interpreted. +The format and usage info about the Selector MAY be further qualified using a `Data Selector Format` column. The `Data Selector Format` SHOULD point to a web resource containing instructions about how the Selector is formatted and how it should be interpreted. ## Examples @@ -639,7 +639,7 @@ The format and usage info about the Selector MAY be further qualified using a `D In this example, there is a measurement of two `Samples`, namely `input1` and `input2`. The values measured are both written into the same data resource in the location `result.csv`, whichs formatting is tabular, according to the `Data Format` being `text/csv`. To distinguish between the measurement values stemming from the different inputs, selectors were added to the resource location (seperated by a `#`), namely `col=1` and `col=2`. The specification about the formatting of these selectors can be found in the provided link, namely `https://datatracker.ietf.org/`. -| Input [Sample Name] | Output [Data] | Data Format | Data Selector | +| Input [Sample Name] | Output [Data] | Data Format | Data Selector Format | |-------------|---------------------------------|----------------------------------|--| | input1 | result.csv#col=1 | text/csv | https://datatracker.ietf.org/doc/html/rfc7111 | | input2 | result.csv#col=2 | text/csv | https://datatracker.ietf.org/doc/html/rfc7111 | From f02c08712e597bb899612b1b13be304a6bf73b4c Mon Sep 17 00:00:00 2001 From: Kevin Schneider Date: Wed, 24 Jan 2024 13:56:58 +0100 Subject: [PATCH 14/35] wip validation specs --- ARC specification.md | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/ARC specification.md b/ARC specification.md index b4b61d5..577b008 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -11,6 +11,7 @@ Licensed under the Creative Commons License CC BY, Version 4.0; you may not use ## Table of Contents - [Annotated Research Context Specification, v1.2](#annotated-research-context-specification-v12) + - [Table of Contents](#table-of-contents) - [Introduction](#introduction) - [Extensions](#extensions) - [ARC Structure and Content](#arc-structure-and-content) @@ -27,9 +28,12 @@ Licensed under the Creative Commons License CC BY, Version 4.0; you may not use - [Investigation and Study Metadata](#investigation-and-study-metadata) - [Top-Level Run Description](#top-level-run-description) - [Data Path Annotation](#data-path-annotation) + - [Examples](#examples) - [Shareable and Publishable ARCs](#shareable-and-publishable-arcs) - [Reproducible ARCs](#reproducible-arcs) - [Mechanism for Quality Control of ARCs](#mechanism-for-quality-control-of-arcs) + - [Structure of the validation branch](#structure-of-the-validation-branch) + - [Structure of the validation\_targets.yml file](#structure-of-the-validation_targetsyml-file) - [Best Practices](#best-practices) - [Community Specific Data Formats](#community-specific-data-formats) - [Compression and Encryption](#compression-and-encryption) @@ -298,9 +302,19 @@ Reproducibility of ARCs refers mainly to its *runs*. Within an ARC, it MUST be p ## Mechanism for Quality Control of ARCs -ARCs are supposed to be living research objects and are as such never complete. Nevertheless, a mechanism to report the current state and quality of an ARC is indispensable. Therefore, ARCs will be scored according to the amount of metadata information available (specifically with established minimal metadata standards such as MinSeqE, MIAPPE, etc.), the quality of data and metadata (this metric will be established in the next version), manual curation and review, and the reuse of ARCs by other researchers measured by physical includes of the data and referencing. +ARCs are supposed to be living research objects and are as such never complete. Nevertheless, a mechanism to continuously report the current state and quality of an ARC is indispensable. This process is further referred to as _validation_ of the ARC against a _target_, where the _target_ is a arbitrary set of validation cases that the ARC MUST pass to qualify as _valid_ in regard to the _target_. A reference implementation of a framework to create and run targets for ARC validation is provided in the [arc-validate repository](). -To foster FAIRification, badges will be earned by reaching certain scores for a transparent review process. +ARCs MAY be validated against 0 or more targets defined in a [validation_targets.yml file](#structure-of-the-validation_targetsyml-file), where the folowing criteria MUST be met for each target: +- the target MUST have a unique _name_ across all validation targets used the ARC. +- the target MUST create a `validation_report.*` file that summarizes the results of validating the ARC against the cases defined in the target. The format of this file is arbitrary, but SHOULD be of an established test result format such as [JUnit XML](https://github.com/windyroad/JUnit-Schema) or [TAP](https://testanything.org/). +- the target MUST create a `badge.svg` file that visually summarizes the results of validating the ARC against the cases defined in the target. The information displayed SHOULD be derivable from the `validation_report.*` file and MUST include the _name_ of the target. + + +### Structure of the validation branch + +To make sure that the result of validating ARCs are + +### Structure of the validation_targets.yml file ## Best Practices From 0831e8d230170354d93d07b8bc91e94b25c5806f Mon Sep 17 00:00:00 2001 From: Kevin Schneider Date: Thu, 25 Jan 2024 09:33:29 +0100 Subject: [PATCH 15/35] add validation branch and targets file specs --- ARC specification.md | 82 +++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 77 insertions(+), 5 deletions(-) diff --git a/ARC specification.md b/ARC specification.md index 577b008..70b29d4 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -304,18 +304,90 @@ Reproducibility of ARCs refers mainly to its *runs*. Within an ARC, it MUST be p ARCs are supposed to be living research objects and are as such never complete. Nevertheless, a mechanism to continuously report the current state and quality of an ARC is indispensable. This process is further referred to as _validation_ of the ARC against a _target_, where the _target_ is a arbitrary set of validation cases that the ARC MUST pass to qualify as _valid_ in regard to the _target_. A reference implementation of a framework to create and run targets for ARC validation is provided in the [arc-validate repository](). -ARCs MAY be validated against 0 or more targets defined in a [validation_targets.yml file](#structure-of-the-validation_targetsyml-file), where the folowing criteria MUST be met for each target: -- the target MUST have a unique _name_ across all validation targets used the ARC. -- the target MUST create a `validation_report.*` file that summarizes the results of validating the ARC against the cases defined in the target. The format of this file is arbitrary, but SHOULD be of an established test result format such as [JUnit XML](https://github.com/windyroad/JUnit-Schema) or [TAP](https://testanything.org/). +ARCs MAY be validated against 0 or more targets defined in a [validation_targets.yml file](#structure-of-the-validation_targetsyml-file), where the following criteria MUST be met for each target: +- the target MUST have a unique _name_ across all validation targets used the ARC. This name MUST used for specifying the target in the [validation_targets.yml file](#structure-of-the-validation_targetsyml-file) and the subfolder names in the [validation branch](#structure-of-the-validation-branch) +- target MUST create a `validation_report.*` file that summarizes the results of validating the ARC against the cases defined in the target. The format of this file SHOULD be of an established test result format such as [JUnit XML](https://github.com/windyroad/JUnit-Schema) or [TAP](https://testanything.org/). - the target MUST create a `badge.svg` file that visually summarizes the results of validating the ARC against the cases defined in the target. The information displayed SHOULD be derivable from the `validation_report.*` file and MUST include the _name_ of the target. - ### Structure of the validation branch -To make sure that the result of validating ARCs are +To make sure that validation results are bundled with the ARC but do not pollute the commit history, validation results MUST be stored in a separate branch of the ARC repository. This branch: +- MUST be named `validation` +- MUST be an orphan branch +- MUST NOT be merged into the `main` branch. +- MUST contain the following folder structure: + + `{$branch}/{$commithash}/{$target}`: + + ``` + validation branch root + └── {$branch} + └── {$commithash} + └── {$target} + ├── badge.svg + └── validation_report.xml + ``` + + where: + - `{$branch}` is the name of the branch the validation was run on + - `{$commithash}` is the full hash of the commit the validation was run on. See + - `{$target}` is the name of the target the validation was run against. This folder then MUST contain the files `validation_report.*` and `badge.svg` as described [above](#mechanism-for-quality-control-of-arcs). + + example: + + > This example shows the validation results of the `main` and `branch-1` branches of the ARC repository against the `target1` and `target2` targets for two commits per branch: + + ``` + validation-branch-root + ├── branch-1 + │ ├── ca82a6dff817ec66f44342007202690a93763949 + │ │ ├── target1 + │ │ │ ├── badge.svg + │ │ │ └── validation_report.xml + │ │ └── target2 + │ │ ├── badge.svg + │ │ └── validation_report.xml + │ └── 085bb3bcb608e1e8451d4b2432f8ecbe6306e7e7 + │ ├── target1 + │ │ ├── badge.svg + │ │ └── validation_report.xml + │ └── target2 + │ ├── badge.svg + │ └── validation_report.xml + └── main + ├── 1234567890abcdef1234567890abcdef12345678 + │ ├── target1 + │ │ ├── badge.svg + │ │ └── validation_report.xml + │ └── target2 + │ ├── badge.svg + │ └── validation_report.xml + └── a11bef06a3f659402fe7563abf99ad00de2209e6 + ├── target1 + │ ├── badge.svg + │ └── validation_report.xml + └── target2 + ├── badge.svg + └── validation_report.xml + ``` ### Structure of the validation_targets.yml file +The `validation_targets.yml` specifies the targets that the branch containing the file will be validated against. Each branch of an ARC MAY contain 0 or 1 `validation_targets.yml` file. If the file is present, it: + - MUST be located in an `.arc` folder in the root of the ARC + - MUST contain the `targets` key which is a list of target names that the current branch will be validated against. + +example: + +> This example shows a `validation_targets.yml` file that specifies that the current branch will be validated against the `target1` and `target2` targets. + +```yaml +targets: + - target1 + - target2 +``` + + ## Best Practices In the next section we provide you with Best Practices to make the use of an ARC even more efficient and valuable for open science. From 5f3acffacbe64709527f08b65d64d5c6bce099cb Mon Sep 17 00:00:00 2001 From: Kevin Schneider Date: Thu, 25 Jan 2024 09:39:46 +0100 Subject: [PATCH 16/35] typo fix --- ARC specification.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/ARC specification.md b/ARC specification.md index 70b29d4..6cd403d 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -302,7 +302,7 @@ Reproducibility of ARCs refers mainly to its *runs*. Within an ARC, it MUST be p ## Mechanism for Quality Control of ARCs -ARCs are supposed to be living research objects and are as such never complete. Nevertheless, a mechanism to continuously report the current state and quality of an ARC is indispensable. This process is further referred to as _validation_ of the ARC against a _target_, where the _target_ is a arbitrary set of validation cases that the ARC MUST pass to qualify as _valid_ in regard to the _target_. A reference implementation of a framework to create and run targets for ARC validation is provided in the [arc-validate repository](). +ARCs are supposed to be living research objects and are as such never complete. Nevertheless, a mechanism to continuously report the current state and quality of an ARC is indispensable. This process is further referred to as _validation_ of the ARC against a _target_, where the _target_ is an arbitrary set of validation cases that the ARC MUST pass to qualify as _valid_ in regard to the _target_. A reference implementation of a framework to create and run targets for ARC validation is provided in the [arc-validate repository](). ARCs MAY be validated against 0 or more targets defined in a [validation_targets.yml file](#structure-of-the-validation_targetsyml-file), where the following criteria MUST be met for each target: - the target MUST have a unique _name_ across all validation targets used the ARC. This name MUST used for specifying the target in the [validation_targets.yml file](#structure-of-the-validation_targetsyml-file) and the subfolder names in the [validation branch](#structure-of-the-validation-branch) @@ -387,7 +387,6 @@ targets: - target2 ``` - ## Best Practices In the next section we provide you with Best Practices to make the use of an ARC even more efficient and valuable for open science. From 895261f8f1f9e1707f9885bec2487cec6d079667 Mon Sep 17 00:00:00 2001 From: Kevin Schneider Date: Thu, 25 Jan 2024 09:40:46 +0100 Subject: [PATCH 17/35] add missing arc-validate link --- ARC specification.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ARC specification.md b/ARC specification.md index 6cd403d..6407221 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -302,7 +302,7 @@ Reproducibility of ARCs refers mainly to its *runs*. Within an ARC, it MUST be p ## Mechanism for Quality Control of ARCs -ARCs are supposed to be living research objects and are as such never complete. Nevertheless, a mechanism to continuously report the current state and quality of an ARC is indispensable. This process is further referred to as _validation_ of the ARC against a _target_, where the _target_ is an arbitrary set of validation cases that the ARC MUST pass to qualify as _valid_ in regard to the _target_. A reference implementation of a framework to create and run targets for ARC validation is provided in the [arc-validate repository](). +ARCs are supposed to be living research objects and are as such never complete. Nevertheless, a mechanism to continuously report the current state and quality of an ARC is indispensable. This process is further referred to as _validation_ of the ARC against a _target_, where the _target_ is an arbitrary set of validation cases that the ARC MUST pass to qualify as _valid_ in regard to the _target_. A reference implementation of a framework to create and run targets for ARC validation is provided in the [arc-validate repository](https://github.com/nfdi4plants/arc-validate). ARCs MAY be validated against 0 or more targets defined in a [validation_targets.yml file](#structure-of-the-validation_targetsyml-file), where the following criteria MUST be met for each target: - the target MUST have a unique _name_ across all validation targets used the ARC. This name MUST used for specifying the target in the [validation_targets.yml file](#structure-of-the-validation_targetsyml-file) and the subfolder names in the [validation branch](#structure-of-the-validation-branch) From d2a7d681092620205c636eb88a5fcf295ee94fe2 Mon Sep 17 00:00:00 2001 From: Kevin Schneider Date: Thu, 25 Jan 2024 09:43:53 +0100 Subject: [PATCH 18/35] add a reference to git manual regarding orphan branches --- ARC specification.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ARC specification.md b/ARC specification.md index 6407221..d3de18b 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -313,7 +313,7 @@ ARCs MAY be validated against 0 or more targets defined in a [validation_targets To make sure that validation results are bundled with the ARC but do not pollute the commit history, validation results MUST be stored in a separate branch of the ARC repository. This branch: - MUST be named `validation` -- MUST be an orphan branch +- MUST be an [orphan branch](https://git-scm.com/docs/git-checkout#Documentation/git-checkout.txt---orphanltnew-branchgt) - MUST NOT be merged into the `main` branch. - MUST contain the following folder structure: From c385cae3147e2a285170695a42934bd3b58947bd Mon Sep 17 00:00:00 2001 From: Kevin Schneider Date: Thu, 25 Jan 2024 09:48:09 +0100 Subject: [PATCH 19/35] remove unnecessary space --- ARC specification.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ARC specification.md b/ARC specification.md index d3de18b..225629c 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -335,7 +335,7 @@ To make sure that validation results are bundled with the ARC but do not pollute example: - > This example shows the validation results of the `main` and `branch-1` branches of the ARC repository against the `target1` and `target2` targets for two commits per branch: + > This example shows the validation results of the `main` and `branch-1` branches of the ARC repository against the `target1` and `target2` targets for two commits per branch: ``` validation-branch-root From 350f07c0d36f168c3b3fa04e9a3a06751c8e79ba Mon Sep 17 00:00:00 2001 From: Kevin Schneider Date: Tue, 30 Jan 2024 16:49:40 +0100 Subject: [PATCH 20/35] Add Validation section, Fix header levels --- ARC specification.md | 238 +++++++++++++++++++++++++++---------------- 1 file changed, 149 insertions(+), 89 deletions(-) diff --git a/ARC specification.md b/ARC specification.md index 225629c..2bf2440 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -8,39 +8,45 @@ This specification is Copyright 2022 by [DataPLANT](https://nfdi4plants.de). Licensed under the Creative Commons License CC BY, Version 4.0; you may not use this file except in compliance with the License. You may obtain a copy of the License at https://creativecommons.org/about/cclicenses/. This license allows re-users to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. Credit must be given to the creator. -## Table of Contents +# Table of Contents - [Annotated Research Context Specification, v1.2](#annotated-research-context-specification-v12) - - [Table of Contents](#table-of-contents) - - [Introduction](#introduction) - - [Extensions](#extensions) - - [ARC Structure and Content](#arc-structure-and-content) - - [High-Level Schema](#high-level-schema) - - [Example ARC structure](#example-arc-structure) - - [ARC Representation](#arc-representation) - - [ISA-XLSX Format](#isa-xlsx-format) - - [Study and Resources](#study-and-resources) - - [Assay Data and Metadata](#assay-data-and-metadata) - - [Workflow Description](#workflow-description) - - [Run Description](#run-description) - - [Additional Payload](#additional-payload) - - [Top-level Metadata and Workflow Description](#top-level-metadata-and-workflow-description) - - [Investigation and Study Metadata](#investigation-and-study-metadata) - - [Top-Level Run Description](#top-level-run-description) - - [Data Path Annotation](#data-path-annotation) - - [Examples](#examples) - - [Shareable and Publishable ARCs](#shareable-and-publishable-arcs) - - [Reproducible ARCs](#reproducible-arcs) - - [Mechanism for Quality Control of ARCs](#mechanism-for-quality-control-of-arcs) - - [Structure of the validation branch](#structure-of-the-validation-branch) - - [Structure of the validation\_targets.yml file](#structure-of-the-validation_targetsyml-file) - - [Best Practices](#best-practices) - - [Community Specific Data Formats](#community-specific-data-formats) - - [Compression and Encryption](#compression-and-encryption) - - [Directory and File Naming Conventions](#directory-and-file-naming-conventions) - - [Appendix: Conversion of ARCs to RO Crates](#appendix-conversion-of-arcs-to-ro-crates) - -## Introduction +- [Table of Contents](#table-of-contents) +- [Introduction](#introduction) + - [Extensions](#extensions) +- [ARC Structure and Content](#arc-structure-and-content) + - [High-Level Schema](#high-level-schema) + - [Example ARC structure](#example-arc-structure) + - [ARC Representation](#arc-representation) + - [ISA-XLSX Format](#isa-xlsx-format) + - [Study and Resources](#study-and-resources) + - [Assay Data and Metadata](#assay-data-and-metadata) + - [Workflow Description](#workflow-description) + - [Run Description](#run-description) + - [Additional Payload](#additional-payload) + - [Top-level Metadata and Workflow Description](#top-level-metadata-and-workflow-description) + - [Investigation and Study Metadata](#investigation-and-study-metadata) + - [Top-Level Run Description](#top-level-run-description) + - [Data Path Annotation](#data-path-annotation) + - [Examples](#examples) +- [Shareable and Publishable ARCs](#shareable-and-publishable-arcs) +- [Reproducible ARCs](#reproducible-arcs) +- [Mechanisms for ARC Quality Control](#mechanisms-for-arc-quality-control) + - [Validation](#validation) + - [Validation cases](#validation-cases) + - [Validation packages](#validation-packages) + - [Reference implementation](#reference-implementation) + - [Continuous quality control](#continuous-quality-control) + - [The cqc branch](#the-cqc-branch) + - [The validation\_packages.yml file](#the-validation_packagesyml-file) + - [Reference implementation](#reference-implementation-1) +- [Best Practices](#best-practices) + - [Community Specific Data Formats](#community-specific-data-formats) + - [Compression and Encryption](#compression-and-encryption) + - [Directory and File Naming Conventions](#directory-and-file-naming-conventions) +- [Appendix: Conversion of ARCs to RO Crates](#appendix-conversion-of-arcs-to-ro-crates) + +# Introduction This document describes a specification for a standardized way of creating a working environment and packaging file-based research data and necessary additional contextual information for working, collaboration, preservation, reproduction, re-use, and archiving as well as distribution. This organization unit is named *Annotated Research Context* (ARC) and is designed to be both human and machine actionable. @@ -54,17 +60,17 @@ This specification is intended as a practical guide for software authors to crea The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in [RFC 2119](https://datatracker.ietf.org/doc/html/rfc2119). This specification is based on the [ISA model](https://isa-specs.readthedocs.io/en/latest/isamodel.html) and the [Common Workflow Specification (v1.2)](https://www.commonwl.org/v1.2/). -### Extensions +## Extensions The ARC specification can be extended in a backwards compatible way and will evolve over time. This is accomplished through a community-driven ARC discussion forum and pull request mechanisms. All changes that are not backwards compatible with the current ARC specification will be implemented in ARC specification v2.0. -## ARC Structure and Content +# ARC Structure and Content ARCs are based on a strict separation of data and metadata content into study material (*studies*), measurement and assay outcomes (*assays*), computation results (*runs*) and computational workflows (*workflows*) generating the latter. The scope or granularity of an ARC aligns with the necessities of individual projects or large experimental setups. -### High-Level Schema +## High-Level Schema Each ARC is a directory containing the following elements: @@ -85,7 +91,7 @@ Note: - Subdirectories and other files in the top-level `studies`, `assays`, `workflows`, and `runs` directories are viewed as additional payload unless they are accompanied by the corresponding mandatory description (`isa.study.xlsx`, `isa.assay.xlsx`, `workflow.cwl`, `run.cwl`) specified below. This is intended to allow gradual migration from existing data storage schemes to the ARC schema. For example, *data files* for an assay may be stored in a subdirectory of `assays/`, but are only identified as an assay of the ARC if metadata is present and complete, including a reference from top-level metadata. -### Example ARC structure +## Example ARC structure ``` @@ -113,7 +119,7 @@ Note: | run.yml [optional] ``` -### ARC Representation +## ARC Representation ARCs are Git repositories, as defined and supported by the [Git C implementation](https://git-scm.org) (version 2.26 or newer) with [Git-LFS extension](https://git-lfs.github.com) (version 2.12.0), or fully compatible implementations. @@ -131,13 +137,13 @@ Notes: - Removing the `.git` top-level subdirectory (and thereby all provenance information captured within the Git history) from a working copy invalidates an ARC. -### ISA-XLSX Format +## ISA-XLSX Format The ISA-XLSX specification is currently part of the ARC specification. Its version therefore follows the version of the ARC specification. https://github.com/nfdi4plants/ARC-specfication/blob/main/ISA-XLSX.md -### Study and Resources +## Study and Resources The characteristics of all material and resources used within the investigation must be specified in a study. Studies must be placed into a unique subdirectory of the top-level `studies` subdirectory. All ISA metadata specific to a single study MUST be annotated in the file `isa.study.xlsx` at the root of the study's subdirectory. This workbook MUST contain a single resources description that can be organized in one or multiple worksheets. @@ -147,7 +153,7 @@ The `study` file MUST follow the [ISA-XLSX study file specification](ISA-XLSX.md Protocols that are necessary to describe the sample or material creating process can be placed under the protocols directory. -### Assay Data and Metadata +## Assay Data and Metadata All measurement data sets are considered as assays and are considered immutable input data. Assay data MUST be placed into a unique subdirectory of the top-level `assays` subdirectory. All ISA metadata specific to a single assay MUST be annotated in the file `isa.assay.xlsx` at the root of the assay's subdirectory. This workbook MUST contain a single assay that can be organized in one or multiple worksheets. @@ -167,7 +173,7 @@ Notes: - While assays MAY in principle contain arbitrary data formats, it is highly RECOMMENDED to use community-supported, open formats (see [Best Practices](#best-practices)). -### Workflow Description +## Workflow Description *Workflows* in ARCs are computational steps that are used in computational analysis of an ARC's assays and other data transformation to generate a [run result](#run-description). Typical examples include data cleaning and preprocessing, computational analysis, or visualization. Workflows are used and combined to generate [run results](#run-description), and allow reuse of processing steps across multiple [run results](#run-description). @@ -189,7 +195,7 @@ Notes: - It is strongly encouraged to include author and contributor metadata in tool descriptions and workflow descriptions as [CWL metadata](https://www.commonwl.org/user_guide/17-metadata/index.html). -### Run Description +## Run Description **Runs** in an ARC represent all artefacts that result from some computation on the data within the ARC, i.e. [assays](#assay-data-and-metadata) and [external data](#external-data). These results (e.g. plots, tables, data files, etc. ) MUST reside inside one or more subdirectory of the top-level `runs` directory. @@ -207,7 +213,7 @@ Notes: - It is strongly encouraged to include author and contributor metadata in run descriptions as [CWL metadata](https://www.commonwl.org/user_guide/17-metadata/index.html). -### Additional Payload +## Additional Payload ARCs can include additional payload according to user requirements, e.g. presentations, reading material, or manuscripts. While these files can be placed anywhere in the ARC, it is strongly advised to organize these in additional subdirectories. Especially for the storage of protocols, it is RECOMMENDED to place protocols (assay SOPs) in text form with the corresponding assay in /assays//protocol/. @@ -216,7 +222,7 @@ Note: - All data missing proper annotation (e.g. studies, assays, workflows or runs) is considered additional payload independent of its location within the ARC. -### Top-level Metadata and Workflow Description +## Top-level Metadata and Workflow Description *Top-level metadata and workflow description* tie together the elements of an ARC in the contexts of an investigation captured in the `isa.investigation.xlsx` file, which MUST be present. @@ -224,20 +230,20 @@ The `investigation` file MUST follow the [ISA-XLSX investigation file specificat Furthermore, top-level reproducibility information SHOULD be provided in the CWL `arc.cwl`. -#### Investigation and Study Metadata +### Investigation and Study Metadata The ARC root directory is identifiable by the presence of the `isa.investigation.xlsx` file in XLSX format. It contains top-level information about the investigation and MUST link all assays and studies within an ARC. Study and assay objects are registered and grouped with an investigation to record other metadata within the relevant contexts. -#### Top-Level Run Description +### Top-Level Run Description The file `arc.cwl` SHOULD exist at the root directory of each ARC. It describes which runs are executed (and specifically, their order) to (re)produce the computational outputs contained within the ARC. `arc.cwl` MUST be a CWL v1.2 workflow description and adhere to the same requirements as [run descriptions](#run-description). In particular, references to study or assay data files, nested workflows MUST use relative paths. An optional file `arc.yml` MAY be provided to specify input parameters. -### Data Path Annotation +## Data Path Annotation All metadata references to files or directories located inside the ARC MUST follow the following patterns: @@ -247,7 +253,7 @@ All metadata references to files or directories located inside the ARC MUST foll - Data nodes in `isa.assay.xlsx` files: The path MAY be specified relative to the `dataset` sub-folder of the assay - Data nodes in `isa.study.xlsx` files: The path MAY be specified relative to the `resources` sub-folder of the study -#### Examples +### Examples In this example, there are two `assays`, with `Assay1`containing a measurement of a `Source` material, producing an output `Raw Data file`. `Assay2` references this `Data file` for producing a new `Derived Data File` @@ -268,7 +274,7 @@ Use of `general pattern` relative paths from the arc root folder: -## Shareable and Publishable ARCs +# Shareable and Publishable ARCs ARCs can be shared in any state. They are considered *publishable* (e.g. for the purpose of minting a DOI) when fulfilling the following conditions: @@ -296,102 +302,156 @@ Notes: - Minimal administrative metadata ensure compliance with DataCite for DOI creation -### Reproducible ARCs +# Reproducible ARCs Reproducibility of ARCs refers mainly to its *runs*. Within an ARC, it MUST be possible to reproduce the *run* data. Therefore, necessary software MUST be available in *workflows*. In the case of non-deterministic software the run results should represent typical examples. -## Mechanism for Quality Control of ARCs +# Mechanisms for ARC Quality Control -ARCs are supposed to be living research objects and are as such never complete. Nevertheless, a mechanism to continuously report the current state and quality of an ARC is indispensable. This process is further referred to as _validation_ of the ARC against a _target_, where the _target_ is an arbitrary set of validation cases that the ARC MUST pass to qualify as _valid_ in regard to the _target_. A reference implementation of a framework to create and run targets for ARC validation is provided in the [arc-validate repository](https://github.com/nfdi4plants/arc-validate). +ARCs are supposed to be living research objects and are as such never complete. +Nevertheless, a mechanism to continuously report the current state and quality of an ARC is indispensable. -ARCs MAY be validated against 0 or more targets defined in a [validation_targets.yml file](#structure-of-the-validation_targetsyml-file), where the following criteria MUST be met for each target: -- the target MUST have a unique _name_ across all validation targets used the ARC. This name MUST used for specifying the target in the [validation_targets.yml file](#structure-of-the-validation_targetsyml-file) and the subfolder names in the [validation branch](#structure-of-the-validation-branch) -- target MUST create a `validation_report.*` file that summarizes the results of validating the ARC against the cases defined in the target. The format of this file SHOULD be of an established test result format such as [JUnit XML](https://github.com/windyroad/JUnit-Schema) or [TAP](https://testanything.org/). -- the target MUST create a `badge.svg` file that visually summarizes the results of validating the ARC against the cases defined in the target. The information displayed SHOULD be derivable from the `validation_report.*` file and MUST include the _name_ of the target. +## Validation -### Structure of the validation branch +The process of assessing quality parameters of an ARC is further referred to as _validation_ of the ARC against a [_validation package_](#validation-packages), where the _validation package_ is an arbitrary set of [validation cases](#validation-cases) that the ARC MUST pass to qualify as _valid_ in regard to the _validation package_. -To make sure that validation results are bundled with the ARC but do not pollute the commit history, validation results MUST be stored in a separate branch of the ARC repository. This branch: -- MUST be named `validation` +### Validation cases + +A **validation case** is the atomic unit of a [validation package](#validation-packages) describing a single, deterministic and reproducible requirement that the ARC MUST satisfy in order to qualify as _valid_ in regard to it. + +Format and scope of these cases naturally vary depending on the type of ARC, aim of the containing validation package and tools used for creating and performing the validation. +Therefore, no further requirements are made on the format of validation cases. + + example: + + > The following example shows a validation case simply defined using natural language. + + ``` + All Sample names in this ARC must be prefixed with the string "Sample_" + ``` + + Any ARC where all sample names are prefixed with the string "Sample_" would be considered valid in regard to this validation case. + +### Validation packages + +A **validation package** bundles a collection of [validation cases](#validation-cases) that the ARC MUST pass to qualify as _valid_ in regard to the _validation package_ with instructions on how to perform the validation and summarize the results. + +Validation packages + +- MUST be executable. + This can for example be achieved by implementing them in a programming language, a shell script, or a workflow language. + +- MUST validate an ARC against all contained validation cases upon execution. + +- MUST have a globally unique name. + This will eventually be enforced by a central validation package registry + +- SHOULD be versioned using [semantic versioning](https://semver.org/) + +- MUST create a `validation_report.*` file upon execution that summarizes the results of validating the ARC against the cases defined in the validation package. + The format of this file SHOULD be of an established test result format such as [JUnit XML](https://github.com/windyroad/JUnit-Schema) or [TAP](https://testanything.org/). + +- MUST create a `badge.svg` file upon execution that visually summarizes the results of validating the ARC against the validation cases defined in the validation package. + The information displayed SHOULD be derivable from the `validation_report.*` file and MUST include the _name_ of the validation package. + +### Reference implementation + +A reference implementation for creating validation cases, validation packages, and validating ARCs against them is provided in the [arc-validate software suit](https://github.com/nfdi4plants/arc-validate) + +## Continuous quality control + +In addition to manually validate ARCs against validation packages, ARCs MAY be continuously validated against validation packages using a continuous integration (CI) system. +This process is further referred to as _Continuous Quality Control_ (CQC) of the ARC. CQC can be triggered by any event that is supported by the CI system, e.g. a push to a branch of the ARC repository or a pull request. + +### The cqc branch + +To make sure that validation results are bundled with the ARC but do not pollute their commit history, validation results MUST be stored in a separate branch of the ARC repository. +This branch: +- MUST be named `cqc` - MUST be an [orphan branch](https://git-scm.com/docs/git-checkout#Documentation/git-checkout.txt---orphanltnew-branchgt) -- MUST NOT be merged into the `main` branch. +- MUST NOT be merged into any other branch. - MUST contain the following folder structure: - `{$branch}/{$commithash}/{$target}`: + `{$branch}/{$commithash}/{$package}`: ``` - validation branch root + cqc branch root └── {$branch} └── {$commithash} - └── {$target} - ├── badge.svg - └── validation_report.xml + └── {$package} ``` where: - `{$branch}` is the name of the branch the validation was run on - - `{$commithash}` is the full hash of the commit the validation was run on. See - - `{$target}` is the name of the target the validation was run against. This folder then MUST contain the files `validation_report.*` and `badge.svg` as described [above](#mechanism-for-quality-control-of-arcs). + - `{$commithash}` is the full hash of the commit the validation was run on. + - `{$package}` is the name of the validation package the validation was run against. + this folder then MUST contain the files `validation_report.*` and `badge.svg` as described in the [validation package specification](#validation-packages). example: - > This example shows the validation results of the `main` and `branch-1` branches of the ARC repository against the `target1` and `target2` targets for two commits per branch: + > This example shows the validation results of the `main` and `branch-1` branches of the ARC repository against the `package1` and `package2` validation packages for two commits per branch: ``` - validation-branch-root + cqc-branch-root ├── branch-1 │ ├── ca82a6dff817ec66f44342007202690a93763949 - │ │ ├── target1 + │ │ ├── package1 │ │ │ ├── badge.svg │ │ │ └── validation_report.xml - │ │ └── target2 + │ │ └── package2 │ │ ├── badge.svg │ │ └── validation_report.xml │ └── 085bb3bcb608e1e8451d4b2432f8ecbe6306e7e7 - │ ├── target1 + │ ├── package1 │ │ ├── badge.svg │ │ └── validation_report.xml - │ └── target2 + │ └── package2 │ ├── badge.svg │ └── validation_report.xml └── main ├── 1234567890abcdef1234567890abcdef12345678 - │ ├── target1 + │ ├── package1 │ │ ├── badge.svg │ │ └── validation_report.xml - │ └── target2 + │ └── package2 │ ├── badge.svg │ └── validation_report.xml └── a11bef06a3f659402fe7563abf99ad00de2209e6 - ├── target1 + ├── package1 │ ├── badge.svg │ └── validation_report.xml - └── target2 + └── package2 ├── badge.svg └── validation_report.xml ``` -### Structure of the validation_targets.yml file +### The validation_packages.yml file -The `validation_targets.yml` specifies the targets that the branch containing the file will be validated against. Each branch of an ARC MAY contain 0 or 1 `validation_targets.yml` file. If the file is present, it: - - MUST be located in an `.arc` folder in the root of the ARC - - MUST contain the `targets` key which is a list of target names that the current branch will be validated against. +The `validation_packages.yml` specifies the validation packages that the branch containing the file will be validated against. +Each branch of an ARC MAY contain 0 or 1 `validation_packages.yml` file. +If the file is present, it: + - MUST be located in the `.arc` folder in the root of the ARC + - MUST contain the `validation_packages` key which is a list of validation package names that the current branch will be validated against. example: -> This example shows a `validation_targets.yml` file that specifies that the current branch will be validated against the `target1` and `target2` targets. +> This example shows a `validation_packages.yml` file that specifies that the current branch will be validated against the `package1` and `package2` targets. ```yaml -targets: - - target1 - - target2 +validation_packages: + - package1 + - package2 ``` -## Best Practices +### Reference implementation + +PLANTDataHUB performs Continuous Quality Control of ARCs using the [arc-validate software suit]() as described in our 2023 paper [PLANTdataHUB: a collaborative platform for continuous FAIR data sharing in plant research](https://doi.org/10.1111/tpj.16474). + +# Best Practices In the next section we provide you with Best Practices to make the use of an ARC even more efficient and valuable for open science. -### Community Specific Data Formats +## Community Specific Data Formats It is recommend to use community specific data formats covering most common measurement techniques. Using the following recommended formats will ensure improved accessibility and findability: @@ -407,18 +467,18 @@ Notes: - In case of storing vendor-specific data within an ARC, it is strongly encouraged to accompany them by the corresponding open formats or provide a workflow for conversion or processing where this is possible and considered standard. -### Compression and Encryption +## Compression and Encryption Compression is preferable to save disk space and speed up data transfers but not required. Without compression workflows are simpler as often no transparent compression and decompression is available. Uncompressed files are usually easier to index and better searchable. Encryption is not advised (but could be an option to share sensitive data in an otherwise open ARC). -### Directory and File Naming Conventions +## Directory and File Naming Conventions Required files defined in the ARC structure need to be named accordingly. Files and folders specified < > can be named freely. As the ARC might be used by different persons and in different workflow contexts, we recommend concise filenames without blanks and special characters. Therefore, filenames SHOULD stick to small and capital letters without umlauts, accented and similar special characters. Numbers, hyphens, and underscores are suitable as well. Modern working environments can handle blanks in filenames but might confuse automatically run scripts and thus SHOULD be avoided. Depending on the intended amount of people the ARC is shared with, certain information might prove useful to provide a fast overview in human readable form in the filename, e.g. by providing abbreviations of the project, sub project, person creating or working on a particular dataset. Date and time information might be encoded as well if it provides a better ordering or information for the particular purpose. -## Appendix: Conversion of ARCs to RO Crates +# Appendix: Conversion of ARCs to RO Crates [Research Object (RO) Crate](https://www.researchobject.org/ro-crate/) is a lightweight approach, based on [schema.org](https://schema.org), to package research data together with their metadata. An ARC can be augmented into an RO Crate by placing a metadata file `ro-crate-metadata.json` into the top-level ARC folder, which must conform to the [RO Crate specification](https://www.researchobject.org/ro-crate/1.1/). From 9fa4574ed7da00cc891c329bd6624cdcf01f3b85 Mon Sep 17 00:00:00 2001 From: Kevin Schneider Date: Wed, 31 Jan 2024 14:07:06 +0100 Subject: [PATCH 21/35] spelling fixes --- ARC specification.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/ARC specification.md b/ARC specification.md index 2bf2440..905b648 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -356,7 +356,7 @@ Validation packages ### Reference implementation -A reference implementation for creating validation cases, validation packages, and validating ARCs against them is provided in the [arc-validate software suit](https://github.com/nfdi4plants/arc-validate) +A reference implementation for creating validation cases, validation packages, and validating ARCs against them is provided in the [arc-validate software suite](https://github.com/nfdi4plants/arc-validate) ## Continuous quality control @@ -365,7 +365,7 @@ This process is further referred to as _Continuous Quality Control_ (CQC) of the ### The cqc branch -To make sure that validation results are bundled with the ARC but do not pollute their commit history, validation results MUST be stored in a separate branch of the ARC repository. +To make sure that validation results are bundled with ARCs but do not pollute their commit history, validation results MUST be stored in a separate branch of the ARC repository. This branch: - MUST be named `cqc` - MUST be an [orphan branch](https://git-scm.com/docs/git-checkout#Documentation/git-checkout.txt---orphanltnew-branchgt) @@ -428,7 +428,7 @@ This branch: ### The validation_packages.yml file The `validation_packages.yml` specifies the validation packages that the branch containing the file will be validated against. -Each branch of an ARC MAY contain 0 or 1 `validation_packages.yml` file. +Each branch of an ARC MAY contain 0 or 1 `validation_packages.yml` files. If the file is present, it: - MUST be located in the `.arc` folder in the root of the ARC - MUST contain the `validation_packages` key which is a list of validation package names that the current branch will be validated against. @@ -445,7 +445,7 @@ validation_packages: ### Reference implementation -PLANTDataHUB performs Continuous Quality Control of ARCs using the [arc-validate software suit]() as described in our 2023 paper [PLANTdataHUB: a collaborative platform for continuous FAIR data sharing in plant research](https://doi.org/10.1111/tpj.16474). +PLANTDataHUB performs Continuous Quality Control of ARCs using the [arc-validate software suite]() as described in our 2023 paper [PLANTdataHUB: a collaborative platform for continuous FAIR data sharing in plant research](https://doi.org/10.1111/tpj.16474). # Best Practices From 115e8c7b9310b6d992c455676f093b4087e0db8c Mon Sep 17 00:00:00 2001 From: Kevin Schneider Date: Wed, 31 Jan 2024 14:07:48 +0100 Subject: [PATCH 22/35] fix missing link --- ARC specification.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ARC specification.md b/ARC specification.md index 905b648..3bfa8fa 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -445,7 +445,7 @@ validation_packages: ### Reference implementation -PLANTDataHUB performs Continuous Quality Control of ARCs using the [arc-validate software suite]() as described in our 2023 paper [PLANTdataHUB: a collaborative platform for continuous FAIR data sharing in plant research](https://doi.org/10.1111/tpj.16474). +PLANTDataHUB performs Continuous Quality Control of ARCs using the [arc-validate software suite](https://github.com/nfdi4plants/arc-validate) as described in our 2023 paper [PLANTdataHUB: a collaborative platform for continuous FAIR data sharing in plant research](https://doi.org/10.1111/tpj.16474). # Best Practices From 280fcf278592787df78e18d11f09a5f4a2633079 Mon Sep 17 00:00:00 2001 From: Kevin Schneider Date: Tue, 13 Feb 2024 14:28:35 +0100 Subject: [PATCH 23/35] Add versioning of validation packages to validation_packages.yml --- ARC specification.md | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/ARC specification.md b/ARC specification.md index 846e483..945c95a 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -433,16 +433,23 @@ The `validation_packages.yml` specifies the validation packages that the branch Each branch of an ARC MAY contain 0 or 1 `validation_packages.yml` files. If the file is present, it: - MUST be located in the `.arc` folder in the root of the ARC - - MUST contain the `validation_packages` key which is a list of validation package names that the current branch will be validated against. + - MUST contain the `validation_packages` key which is a list of validation packages that the current branch will be validated against. + +values of the `validation_packages` list are objects with the following fields: + - `name`: the name of the validation package. This field is mandatory and MUST be included for each validation package object. + - `version`: the version of the validation package. This field is optional and MAY be included for each validation package object. If included, it MUST be a valid [semantic version](https://semver.org/), restricted to MAJOR.MINOR.PATCH format. If not included, this indicates that the latest available version of the validation package will be used. example: -> This example shows a `validation_packages.yml` file that specifies that the current branch will be validated against the `package1` and `package2` targets. +> This example shows a `validation_packages.yml` file that specifies that the current branch will be validated against version `1.0.0` of `package1`, version `2.0.0` of `package2`, and the latest available version of `package3`. ```yaml validation_packages: - - package1 - - package2 + - name: package1 + version: 1.0.0 + - name: package2 + version: 2.0.0 + - name: package3 ``` ### Reference implementation From 077475a8ad56114bd0c6f62f5d612f2ca36b3da1 Mon Sep 17 00:00:00 2001 From: Kevin Schneider Date: Wed, 21 Feb 2024 14:35:10 +0100 Subject: [PATCH 24/35] let git do the cqc versioning, add optional version suffix in cqc folder structure --- ARC specification.md | 54 ++++++++++++++++---------------------------- 1 file changed, 20 insertions(+), 34 deletions(-) diff --git a/ARC specification.md b/ARC specification.md index 945c95a..f2e600a 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -369,64 +369,50 @@ This process is further referred to as _Continuous Quality Control_ (CQC) of the To make sure that validation results are bundled with ARCs but do not pollute their commit history, validation results MUST be stored in a separate branch of the ARC repository. This branch: + - MUST be named `cqc` - MUST be an [orphan branch](https://git-scm.com/docs/git-checkout#Documentation/git-checkout.txt---orphanltnew-branchgt) - MUST NOT be merged into any other branch. - MUST contain the following folder structure: - `{$branch}/{$commithash}/{$package}`: + `{$branch}/{$package}`: ``` cqc branch root └── {$branch} - └── {$commithash} - └── {$package} + └── {$package} ``` where: - `{$branch}` is the name of the branch the validation was run on - - `{$commithash}` is the full hash of the commit the validation was run on. - `{$package}` is the name of the validation package the validation was run against. this folder then MUST contain the files `validation_report.*` and `badge.svg` as described in the [validation package specification](#validation-packages). + This folder MAY also be suffixed by the version of the validation package via a `@` character followed by the version number of the validation package: `{$package}@{$version}`, e.g. `package1@1.0.0`. example: - > This example shows the validation results of the `main` and `branch-1` branches of the ARC repository against the `package1` and `package2` validation packages for two commits per branch: + > This example shows the validation results of the `main` and `branch-1` branches of the ARC repository against the `package1` and `package2` validation packages. for `package2`, an optional version hint of the package is included in the folder name: ``` cqc-branch-root ├── branch-1 - │ ├── ca82a6dff817ec66f44342007202690a93763949 - │ │ ├── package1 - │ │ │ ├── badge.svg - │ │ │ └── validation_report.xml - │ │ └── package2 - │ │ ├── badge.svg - │ │ └── validation_report.xml - │ └── 085bb3bcb608e1e8451d4b2432f8ecbe6306e7e7 - │ ├── package1 - │ │ ├── badge.svg - │ │ └── validation_report.xml - │ └── package2 - │ ├── badge.svg - │ └── validation_report.xml + │ ├── package1 + │ │ ├── badge.svg + │ │ └── validation_report.xml + │ └── package2@2.0.0 + │ ├── badge.svg + │ └── validation_report.xml └── main - ├── 1234567890abcdef1234567890abcdef12345678 - │ ├── package1 - │ │ ├── badge.svg - │ │ └── validation_report.xml - │ └── package2 - │ ├── badge.svg - │ └── validation_report.xml - └── a11bef06a3f659402fe7563abf99ad00de2209e6 - ├── package1 - │ ├── badge.svg - │ └── validation_report.xml - └── package2 - ├── badge.svg - └── validation_report.xml + ├── package1 + │ ├── badge.svg + │ └── validation_report.xml + └── package2@2.0.0 + ├── badge.svg + └── validation_report.xml ``` +Commits to the `cqc` branch MUST contain the commit hash of the commit that was validated in the commit message. + ### The validation_packages.yml file The `validation_packages.yml` specifies the validation packages that the branch containing the file will be validated against. @@ -436,7 +422,7 @@ If the file is present, it: - MUST contain the `validation_packages` key which is a list of validation packages that the current branch will be validated against. values of the `validation_packages` list are objects with the following fields: - - `name`: the name of the validation package. This field is mandatory and MUST be included for each validation package object. + - `name`: the name of the validation package. This field is mandatory and MUST be included for each validation package object. This name MUST be unique across all validation packages object, which means that only one version of a package can be contained in the file. - `version`: the version of the validation package. This field is optional and MAY be included for each validation package object. If included, it MUST be a valid [semantic version](https://semver.org/), restricted to MAJOR.MINOR.PATCH format. If not included, this indicates that the latest available version of the validation package will be used. example: From b66322dff8b8567a9dfd1c8f4b88b4c9412b0e6a Mon Sep 17 00:00:00 2001 From: Florian Wetzels <36967183+floWetzels@users.noreply.github.com> Date: Tue, 27 Feb 2024 11:29:19 +0100 Subject: [PATCH 25/35] Updated ARC RO-Crate Profile Description --- ARC specification.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/ARC specification.md b/ARC specification.md index 945c95a..bb8ef03 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -493,14 +493,15 @@ As the ARC might be used by different persons and in different workflow contexts An ARC can be augmented into an RO Crate by placing a metadata file `ro-crate-metadata.json` into the top-level ARC folder, which must conform to the [RO Crate specification](https://www.researchobject.org/ro-crate/1.1/). The ARC root folder is then simultaneously the RO Crate Root and represents an ISA investigation. The studies, assays and workflows are part of the investigation and linked to it using the typical RO-Crate methodology, e.g. the `hasPart` property of `http://schema.org/Dataset`. -All four object types follow their corresponding profiles (WIP for studies, assays and workflows). +All four object types follow their [corresponding profiles](https://github.com/nfdi4plants/isa-ro-crate-profile/blob/main/profile/isa_ro_crate.md). It is RECOMMENDED to adhere to the following conventions when creating this file: -- The root data entity follows the [ISA Investigation profile](https://github.com/nfdi4plants/arc-to-rocrate/blob/main/profiles/investigation.md). +- The root data entity follows the [ISA Investigation profile]([https://github.com/nfdi4plants/arc-to-rocrate/blob/main/profiles/investigation.md](https://github.com/nfdi4plants/isa-ro-crate-profile/blob/main/profile/isa_ro_crate.md)). - The root data entity description are taken from the "Investigation Description" term in `isa.investigation.xlsx`. - The root data entity authors are taken from the "Investigation Contacts" in `isa.investigation.xlsx`: - The root data entity citations are taken from the "Investigation Publications" section in `isa.investigation.xlsx`. - For each assay and study linked from `isa.investigation.xlsx`, one dataset entity is provided in `ro-crate-metadata.json`. The Dataset id corresponds to the relative path of the assay ISA file under `assays/`, e.g. "sample-data/isa.assay.xlsx". Other metadata is taken from the corresponding terms in the corresponding `isa.assay.xlsx` or `isa.study.xlsx`. - The root data entity is connected to each assay and study through the `hasPart` Property. +- The assay and study entities follow the [ISA Assay Profile](https://github.com/nfdi4plants/isa-ro-crate-profile/blob/main/profile/isa_ro_crate.md) or the [ISA Study Profile](https://github.com/nfdi4plants/isa-ro-crate-profile/blob/main/profile/isa_ro_crate.md), respectively. It is expected that future versions of this specification will provide additional guidance on a comprehensive conversion of ARC metadata into RO-Crate metadata. From dd7aa523ca31f0145278ec4b044f18df3937235b Mon Sep 17 00:00:00 2001 From: Florian Wetzels Date: Wed, 28 Feb 2024 13:10:20 +0100 Subject: [PATCH 26/35] fixed typos in RO-Crate profile description --- ARC specification.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/ARC specification.md b/ARC specification.md index bb8ef03..d404dfc 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -496,9 +496,9 @@ The studies, assays and workflows are part of the investigation and linked to it All four object types follow their [corresponding profiles](https://github.com/nfdi4plants/isa-ro-crate-profile/blob/main/profile/isa_ro_crate.md). It is RECOMMENDED to adhere to the following conventions when creating this file: -- The root data entity follows the [ISA Investigation profile]([https://github.com/nfdi4plants/arc-to-rocrate/blob/main/profiles/investigation.md](https://github.com/nfdi4plants/isa-ro-crate-profile/blob/main/profile/isa_ro_crate.md)). +- The root data entity follows the [ISA Investigation profile](https://github.com/nfdi4plants/isa-ro-crate-profile/blob/main/profile/isa_ro_crate.md). - The root data entity description are taken from the "Investigation Description" term in `isa.investigation.xlsx`. - - The root data entity authors are taken from the "Investigation Contacts" in `isa.investigation.xlsx`: + - The root data entity authors are taken from the "Investigation Contacts" in `isa.investigation.xlsx`. - The root data entity citations are taken from the "Investigation Publications" section in `isa.investigation.xlsx`. - For each assay and study linked from `isa.investigation.xlsx`, one dataset entity is provided in `ro-crate-metadata.json`. The Dataset id corresponds to the relative path of the assay ISA file under `assays/`, e.g. "sample-data/isa.assay.xlsx". Other metadata is taken from the corresponding terms in the corresponding `isa.assay.xlsx` or `isa.study.xlsx`. - The root data entity is connected to each assay and study through the `hasPart` Property. From 41269403df4a9048bc96412c2bf59a08db2846b4 Mon Sep 17 00:00:00 2001 From: HLWeil Date: Wed, 6 Mar 2024 11:06:07 +0100 Subject: [PATCH 27/35] make some sections in investigation and study metadata sheets optional --- ISA-XLSX.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/ISA-XLSX.md b/ISA-XLSX.md index 04d8e8a..edfbb8b 100644 --- a/ISA-XLSX.md +++ b/ISA-XLSX.md @@ -73,6 +73,9 @@ The `Investigation File` MUST contain one [`Top-Level Metadata sheet`](#top-leve - [`INVESTIGATION`](#investigation) - [`INVESTIGATION PUBLICATIONS`](#investigation-publications) - [`INVESTIGATION CONTACTS`](#investigation-contacts) + +Additionally, it MAY contain the following sections: + - [`STUDY`](#study-section) - [`STUDY DESIGN DESCRIPTORS`](#study-design-descriptors) - [`STUDY PUBLICATIONS`](#study-publications) @@ -92,10 +95,13 @@ The `Study File` MUST contain one [`Top-Level Metadata sheet`](#top-level-metada - [`STUDY`](#study-section) - [`STUDY DESIGN DESCRIPTORS`](#study-design-descriptors) - [`STUDY PUBLICATIONS`](#study-publications) +- [`STUDY CONTACTS`](#study-contacts) + +Additionally, it MAY contain the following sections: + - [`STUDY FACTORS`](#study-factors) - [`STUDY ASSAYS`](#study-assays) - [`STUDY PROTOCOLS`](#study-protocols) -- [`STUDY CONTACTS`](#study-contacts) Additionally, the `Study File` SHOULD contain one or more [`Annotation Table sheet(s)`](#annotation-table-sheets), which MAY record provenance of biological samples, from source material through a collection process to sample material. From afc66e9f1d48313fc4aa74bd506e376cbeabef75 Mon Sep 17 00:00:00 2001 From: Kevin Schneider Date: Tue, 23 Apr 2024 17:28:42 +0200 Subject: [PATCH 28/35] add `arc_specification` key to `validation_packages.yml` --- ARC specification.md | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/ARC specification.md b/ARC specification.md index 2894393..fee8648 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -29,6 +29,7 @@ Licensed under the Creative Commons License CC BY, Version 4.0; you may not use - [Top-Level Run Description](#top-level-run-description) - [Data Path Annotation](#data-path-annotation) - [Examples](#examples) + - [General Pattern](#general-pattern) - [Shareable and Publishable ARCs](#shareable-and-publishable-arcs) - [Reproducible ARCs](#reproducible-arcs) - [Mechanisms for ARC Quality Control](#mechanisms-for-arc-quality-control) @@ -255,7 +256,7 @@ All metadata references to files or directories located inside the ARC MUST foll ### Examples -##### General Pattern +#### General Pattern In this example, there are two `assays`, with `Assay1`containing a measurement of a `Source` material, producing an output `Data`. `Assay2` references this `Data` for producing a new `Data`. @@ -273,9 +274,6 @@ Use of `general pattern` relative paths from the arc root folder: |----------------------------------|---------------------------------|----------------------------------| | assays/Assay1/dataset/measurement.txt | assays/Assay2/dataset/script.sh | assays/Assay2/dataset/result.txt | - - - # Shareable and Publishable ARCs ARCs can be shared in any state. They are considered *publishable* (e.g. for the purpose of minting a DOI) when fulfilling the following conditions: @@ -418,18 +416,22 @@ Commits to the `cqc` branch MUST contain the commit hash of the commit that was The `validation_packages.yml` specifies the validation packages that the branch containing the file will be validated against. Each branch of an ARC MAY contain 0 or 1 `validation_packages.yml` files. If the file is present, it: - - MUST be located in the `.arc` folder in the root of the ARC - - MUST contain the `validation_packages` key which is a list of validation packages that the current branch will be validated against. -values of the `validation_packages` list are objects with the following fields: +- MAY contain a `specification` key which, when present, MUST contain the version of the ARC specification that the ARC should be validated against. Schema specification should be tied to specification releases, and be directly integrated into tools that can perform validation against validation packages. +- MUST be located in the `.arc` folder in the root of the ARC +- MUST contain the `validation_packages` key which is a list of validation packages that the current branch will be validated against. + + values of the `validation_packages` list are objects with the following fields: + - `name`: the name of the validation package. This field is mandatory and MUST be included for each validation package object. This name MUST be unique across all validation packages object, which means that only one version of a package can be contained in the file. - `version`: the version of the validation package. This field is optional and MAY be included for each validation package object. If included, it MUST be a valid [semantic version](https://semver.org/), restricted to MAJOR.MINOR.PATCH format. If not included, this indicates that the latest available version of the validation package will be used. example: -> This example shows a `validation_packages.yml` file that specifies that the current branch will be validated against version `1.0.0` of `package1`, version `2.0.0` of `package2`, and the latest available version of `package3`. +> This example shows a `validation_packages.yml` file that specifies that the current branch will be validated against: version `2.0.0-draft` of the ARC specification, version `1.0.0` of `package1`, version `2.0.0` of `package2`, and the latest available version of `package3`. ```yaml +arc_specification: 2.0.0-draft validation_packages: - name: package1 version: 1.0.0 From a5a029cdb79bb335c20ef654ff06f1aeeb9d39ab Mon Sep 17 00:00:00 2001 From: HLWeil Date: Thu, 2 May 2024 12:25:52 +0200 Subject: [PATCH 29/35] add Datamap Table sheets section --- ISA-XLSX.md | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 68 insertions(+), 1 deletion(-) diff --git a/ISA-XLSX.md b/ISA-XLSX.md index edfbb8b..9b57ceb 100644 --- a/ISA-XLSX.md +++ b/ISA-XLSX.md @@ -608,6 +608,8 @@ For example, the `ASSAY PERFORMERS` section of an ISA-XLSX `isa.assay.xlsx` file # Annotation Table sheets +`Annotation Table sheets` are used to describe the experimental flow in detailed, machine readable way. In each sheet, there is a mapping from input entities to output entities, placed in the `Input` and `Output` columns, accordingly. The other columns then are used to either describe those entities or the processes that led to this mapping. + In the `Annotation Table sheets`, column headers MUST have the first letter of each word in upper case, with the exception of the referencing label (REF). The content of the annotation table MUST be placed in an `xlsx table` whose name starts with `annotationTable`. Each sheet MUST contain at most one such annotation table. Only cells inside this table are considered as part of the formatted metadata. @@ -760,4 +762,69 @@ If we pool two sources into a single sample, we might represent this as: | Input [Source Name] | Protocol REF | Output [Sample Name] | |---------------|-------------------|---------------| | source1 | sample collection | sample1 | -| source2 | sample collection | sample1 | \ No newline at end of file +| source2 | sample collection | sample1 | + +# Datamap Table sheets + +`Datamap Table sheets` are used to describe the contents of data files. + +In the `Datamap Table sheets`, column headers MUST have the first letter of each word in upper case, with the exception of the referencing label (REF). + +The content of the datamap table MUST be placed in an `xlsx table` whose name starts with `datamapTable`. Each sheet MUST contain at most one such annotation table. Only cells inside this table are considered as part of the formatted metadata. + +`Datamap Table sheets` are structured with fields organized on a per-row basis. The first row MUST be used for column headers. Each body row is an implementation of a `data` node. + +## Data column + +Every `Datamap Table sheet` MUST contain a `Data` column. Every object in this column MUST correspond to a relevant data resource location, following the [Data Path Annotation](/ARC%20specification.md#data-path-annotation) patterns. If the annotation of the `Data` node refers not to the complete resource, but a part of it, a `Selector` MAY be added. This Selector MUST be separated from the resource location using a `#`— with no whitespace between: `location#selector`. If appropriate, the Selector SHOULD be formatted according to IRI fragment selectors specified by [W3](https://www.w3.org/TR/annotation-model/#fragment-selector). + +The format of the data resource MAY be further qualified using a `Data Format` column. The `Data Format` SHOULD be expressed using a [MIME format](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types), most commonly consisting of two parts: a type and a subtype, separated by a slash (/) — with no whitespace between: `type/subtype`. If appropriate, a format from the list composed by [IANA](https://www.iana.org/assignments/media-types/media-types.xhtml) +SHOULD be picked. Unregistered or niche encoding and file formats MAY be indicated instead via the most appropriate URL. + +The format and usage info about the Selector MAY be further qualified using a `Data Selector Format` column. The `Data Selector Format` SHOULD point to a web resource containing instructions about how the Selector is formatted and how it should be interpreted. + +## Explication column + +Every `Datamap Table sheet` SHOULD contain an `Explication` column. The `Explication` adds explicit meaning to the data node. The value MUST be free text, or an [`Ontology Annotation`](#ontology-annotations). + +| Explication | Term Source REF | Term Accession Number | +|------------------------|-------------------|-------------------------| +| average value | OBI | [http://…/obo/OBI_0000679](http://purl.obolibrary.org/obo/OBI_0000679) | + +## Unit column + +Every `Datamap Table sheet` SHOULD contain an `Unit` column. The `Unit` adds a unit of measurement to the data node. The value MUST be free text, or an [`Ontology Annotation`](#ontology-annotations). + +| Unit | Term Source REF | Term Accession Number | +|------------------------|-------------------|-------------------------| +| gram per milliliter | UO | [http://…/obo/UO_0000173](http://purl.obolibrary.org/obo/UO_0000173) | + +## Object Type column + +Every `Datamap Table sheet` SHOULD contain an `Object Type` column. The `Object Type` defines the shape or format in which the data node is represented. The value MUST be free text, or an [`Ontology Annotation`](#ontology-annotations). + +| Object Type | Term Source REF | Term Accession Number | +|------------------------|-------------------|-------------------------| +| Float | NCIT | [http://…/obo/NCIT_C48150](http://purl.obolibrary.org/obo/NCIT_C48150) | + +## Generated by column + +Every `Datamap Table sheet` SHOULD contain an `Generated By` column. The `Generated By` names the tool which led to the creation of the data node. The value MUST be free text. + +If possible, the value in this column MUST correspond to a relevant data resource location, following the [Data Path Annotation](/ARC%20specification.md#data-path-annotation) patterns. + +| Generated By | +|------------------------| +| GeneStatisticsTool.exe | + +## Examples + +For example, a simple `datamap` table representing a tabular datafile might look as follows: + +| Data | Explication | Term Source REF | Term Accession Number | Unit | Term Source REF | Term Accession Number | Object Type | Term Source REF | Term Accession Number | GeneratedBy | +|---------------|---------------|-------------------|---------------|---------------|-------------------|---------------|---------------|-------------------|---------------|---------------| +| MyData.csv#col=1 | Gene Identifier | NCIT | [http://…/obo/NCIT_C48664](http://purl.obolibrary.org/obo/NCIT_C48664) | | | | String | NCIT | [http://…/obo/NCIT_C45253](http://purl.obolibrary.org/obo/NCIT_C45253) |GeneStatisticsTool.exe | +| MyData.csv#col=2 | average value | OBI | [http://…/obo/OBI_0000679](http://purl.obolibrary.org/obo/OBI_0000679) | gram per milliliter | UO | [http://…/obo/UO_0000173](http://purl.obolibrary.org/obo/UO_0000173) | Float | NCIT | [http://…/obo/NCIT_C48150](http://purl.obolibrary.org/obo/NCIT_C48150) |GeneStatisticsTool.exe | +| MyData.csv#col=3 | p-value | OBI | [http://…/obo/OBI_0000175](http://purl.obolibrary.org/obo/OBI_0000175) | | | | Float | NCIT | [http://…/obo/NCIT_C48150](http://purl.obolibrary.org/obo/NCIT_C48150) |GeneStatisticsTool.exe | + +In this example, the `datamap` table describes a single data file named `MyData.csv`. This file contains three columns. The first column contains gene identifiers, the other two results of a statistical analysis performed by the tool GeneStatisticsTool.exe. \ No newline at end of file From e8dd204cd5fd154d2ca1665244d6cffc6bface79 Mon Sep 17 00:00:00 2001 From: HLWeil Date: Thu, 2 May 2024 14:05:24 +0200 Subject: [PATCH 30/35] add datamap file section to isa-xlsx --- ISA-XLSX.md | 55 ++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 46 insertions(+), 9 deletions(-) diff --git a/ISA-XLSX.md b/ISA-XLSX.md index 9b57ceb..693525a 100644 --- a/ISA-XLSX.md +++ b/ISA-XLSX.md @@ -9,6 +9,7 @@ This document describes the ISA Abstract Model reference implementation specifie - [Investigation File](#investigation-file) - [Study File](#study-file) - [Assay File](#assay-file) +- [Datamap File](#datamap-file) - [Top-level metadata sheets](#top-level-metadata-sheets) - [Ontology Source Reference section](#ontology-source-reference-section) - [INVESTIGATION section](#investigation-section) @@ -24,7 +25,17 @@ This document describes the ISA Abstract Model reference implementation specifie - [Components](#components) - [Parameters](#parameters) - [Comments](#comments) - - [Examples](#examples) + - [Examples](#examples-1) +- [Datamap Table sheets](#datamap-table-sheets) + - [Data](#data-column) + - [Explication](#explication-column) + - [Unit](#unit-column) + - [Object Type](#object-type-column) + - [Description](#description-column) + - [Generated By](#generated-by-column) + - [Comment](#comments-1) + - [Examples](#examples-2) + Below we provide the schemas and the content rules for valid ISA-XLSX documents. @@ -124,6 +135,16 @@ Therefore, the main entities of the `Assay File` should be `Samples` and `Data`. The `Assay File` implements the [`Assay`](https://isa-specs.readthedocs.io/en/latest/isamodel.html#assay) graph from the ISA Abstract Model. +# Datamap File + +The `Datamap` represents a set of explanations about the `data` entities defined in `assays` and `studies`. + +The `Datamap File` MUST contain one [`Datamap table sheet`](#datamap-table-sheets). This sheet MUST be named `isa_datamap`. + +Therefore, the main entities of the `Datamap File` should be `Data`. + +The `Datamap File` acts as an extension of the `data` nodes defined in the [`Study and Assay graphs section`](https://isa-specs.readthedocs.io/en/latest/isamodel.html#study-and-assay-graphs) from the ISA Abstract Model. + # Top-level metadata sheets The purpose of top-level metadata sheets is aggregating and listing top-level metadata. Each sheet consists of sections consisting of a section header and key-value fields. Section headers MUST be completely written in upper case (e.g. STUDY), field headers MUST have the first letter of each word in upper case (e.g. Study Identifier); with the exception of the referencing label (REF). @@ -797,7 +818,7 @@ Every `Datamap Table sheet` SHOULD contain an `Unit` column. The `Unit` adds a u | Unit | Term Source REF | Term Accession Number | |------------------------|-------------------|-------------------------| -| gram per milliliter | UO | [http://…/obo/UO_0000173](http://purl.obolibrary.org/obo/UO_0000173) | +| milligram per milliliter | UO | [http://…/obo/UO_0000176](http://purl.obolibrary.org/obo/UO_0000176) | ## Object Type column @@ -807,9 +828,17 @@ Every `Datamap Table sheet` SHOULD contain an `Object Type` column. The `Object |------------------------|-------------------|-------------------------| | Float | NCIT | [http://…/obo/NCIT_C48150](http://purl.obolibrary.org/obo/NCIT_C48150) | -## Generated by column +## Description column + +Every `Datamap Table sheet` SHOULD contain a `Description` column. The `Description` gives additional, humand readable context about the data node. The value MUST be free text. + +| Description | +|------------------------| +| The average protein concentration for the given gene | + +## Generated By column -Every `Datamap Table sheet` SHOULD contain an `Generated By` column. The `Generated By` names the tool which led to the creation of the data node. The value MUST be free text. +Every `Datamap Table sheet` SHOULD contain a `Generated By` column. The `Generated By` names the tool which led to the creation of the data node. The value MUST be free text. If possible, the value in this column MUST correspond to a relevant data resource location, following the [Data Path Annotation](/ARC%20specification.md#data-path-annotation) patterns. @@ -817,14 +846,22 @@ If possible, the value in this column MUST correspond to a relevant data resourc |------------------------| | GeneStatisticsTool.exe | +## Comments + +A `Comment` can be used to provide some additional information. Columns headed with `Comment[]` MAY appear anywhere in the Annotation Table. The comment always refers to the Annotation Table. The value MUST be free text. + +| Comment [Answer to everything] | +|--------------------------------| +| forty-two | + ## Examples For example, a simple `datamap` table representing a tabular datafile might look as follows: -| Data | Explication | Term Source REF | Term Accession Number | Unit | Term Source REF | Term Accession Number | Object Type | Term Source REF | Term Accession Number | GeneratedBy | -|---------------|---------------|-------------------|---------------|---------------|-------------------|---------------|---------------|-------------------|---------------|---------------| -| MyData.csv#col=1 | Gene Identifier | NCIT | [http://…/obo/NCIT_C48664](http://purl.obolibrary.org/obo/NCIT_C48664) | | | | String | NCIT | [http://…/obo/NCIT_C45253](http://purl.obolibrary.org/obo/NCIT_C45253) |GeneStatisticsTool.exe | -| MyData.csv#col=2 | average value | OBI | [http://…/obo/OBI_0000679](http://purl.obolibrary.org/obo/OBI_0000679) | gram per milliliter | UO | [http://…/obo/UO_0000173](http://purl.obolibrary.org/obo/UO_0000173) | Float | NCIT | [http://…/obo/NCIT_C48150](http://purl.obolibrary.org/obo/NCIT_C48150) |GeneStatisticsTool.exe | -| MyData.csv#col=3 | p-value | OBI | [http://…/obo/OBI_0000175](http://purl.obolibrary.org/obo/OBI_0000175) | | | | Float | NCIT | [http://…/obo/NCIT_C48150](http://purl.obolibrary.org/obo/NCIT_C48150) |GeneStatisticsTool.exe | +| Data | Explication | Term Source REF | Term Accession Number | Unit | Term Source REF | Term Accession Number | Object Type | Term Source REF | Term Accession Number | Description |GeneratedBy | +|---------------|---------------|-------------------|---------------|---------------|-------------------|---------------|---------------|-------------------|---------------|---------------|---------------| +| MyData.csv#col=1 | Gene Identifier | NCIT | [http://…/obo/NCIT_C48664](http://purl.obolibrary.org/obo/NCIT_C48664) | | | | String | NCIT | [http://…/obo/NCIT_C45253](http://purl.obolibrary.org/obo/NCIT_C45253) | Short hand identifier of the gene coding for the protein. | GeneStatisticsTool.exe | +| MyData.csv#col=2 | average value | OBI | [http://…/obo/OBI_0000679](http://purl.obolibrary.org/obo/OBI_0000679) | milligram per milliliter | UO | [http://…/obo/UO_0000176](http://purl.obolibrary.org/obo/UO_0000176) | Float | NCIT | [http://…/obo/NCIT_C48150](http://purl.obolibrary.org/obo/NCIT_C48150) | The average protein concentration for the given gene |GeneStatisticsTool.exe | +| MyData.csv#col=3 | p-value | OBI | [http://…/obo/OBI_0000175](http://purl.obolibrary.org/obo/OBI_0000175) | | | | Float | NCIT | [http://…/obo/NCIT_C48150](http://purl.obolibrary.org/obo/NCIT_C48150) | p-value of t-test against control. | GeneStatisticsTool.exe | In this example, the `datamap` table describes a single data file named `MyData.csv`. This file contains three columns. The first column contains gene identifiers, the other two results of a statistical analysis performed by the tool GeneStatisticsTool.exe. \ No newline at end of file From 41d513f25daa5400e959d4e5457357d4f08ccac5 Mon Sep 17 00:00:00 2001 From: HLWeil Date: Thu, 2 May 2024 14:21:19 +0200 Subject: [PATCH 31/35] reference datamap file from arc assay and study specifications --- ARC specification.md | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/ARC specification.md b/ARC specification.md index 2894393..07d5981 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -75,9 +75,9 @@ ARCs are based on a strict separation of data and metadata content into study ma Each ARC is a directory containing the following elements: - *Studies* are collections of material and resources used within the investigation. -Metadata that describe the characteristics of material and resources follow the ISA study model. Study-level metadata is stored in [ISA-XLSX](#isa-xlsx-format) format in a file `isa.study.xlsx`, which MUST exist to specify the input material or data resources. Resources MAY include biological materials (e.g. plant samples, analytical standards) created during the current investigation. Resources MAY further include external data (e.g., knowledge files, results files) that need to be included and cannot be referenced due to external limitations. Resources described in a study file can be the input for one or multiple assays. Further details on `isa.study.xlsx` are specified [below](#study-and-resources). Resource (descriptor) files MUST be placed in a `resources` subdirectory. +Metadata that describe the characteristics of material and resources follow the ISA study model. Study-level metadata is stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.study.xlsx` file, which MUST exist to specify the input material or data resources. Resources MAY include biological materials (e.g. plant samples, analytical standards) created during the current investigation. Resources MAY further include external data (e.g., knowledge files, results files) that need to be included and cannot be referenced due to external limitations. Resources described in a study file can be the input for one or multiple assays. Further details on `isa.study.xlsx` are specified [below](#study-and-resources). Resource (descriptor) files MUST be placed in a `resources` subdirectory. Further explications about data entities defined in the study MAY be stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.datamap.xlsx` file, which SHOULD exist for studies containing data. Further details on `isa.datamap.xlsx` are specified [in the isa-xlsx specification](ISA-XLSX.md#datamap-file). -- *Assays* correspond to outcomes of experimental assays or analytical measurements (in the interpretation of the ISA model) and are treated as immutable data. Each assay is a collection of files, together with a corresponding metadata file, stored in a subdirectory of the top-level subdirectory `assays`. Assay-level metadata is stored in [ISA-XLSX](#isa-xlsx-format) format in a file `isa.assay.xlsx`, which MUST exist for each assay. Further details on `isa.assay.xlsx` are specified [below](#assay-data-and-metadata). Assay data files MUST be placed in a `dataset` subdirectory. +- *Assays* correspond to outcomes of experimental assays or analytical measurements (in the interpretation of the ISA model) and are treated as immutable data. Each assay is a collection of files, together with a corresponding metadata file, stored in a subdirectory of the top-level subdirectory `assays`. Assay-level metadata is stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.assay.xlsx` file, which MUST exist for each assay. Further details on `isa.assay.xlsx` are specified [below](#assay-data-and-metadata). Assay data files MUST be placed in a `dataset` subdirectory. Further explications about data entities defined in the assay MAY be stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.datamap.xlsx` file, which SHOULD exist for each assay. Further details on `isa.datamap.xlsx` are specified [in the isa-xlsx specification](ISA-XLSX.md#datamap-file). - *Workflows* represent data analysis routines (in the sense of CWL tools and workflows) and are a collection of files, together with a corresponding CWL description, stored in a single directory under the top-level `workflows` subdirectory. A per-workflow executable CWL description is stored in `workflow.cwl`, which MUST exist for all ARC workflows. Further details on workflow descriptions are given [below](#workflow-description). @@ -101,11 +101,13 @@ Note: \--- studies \--- | isa.study.xlsx + | isa.datamap.xlsx \--- resources \--- protocol [optional / add. payload] \--- assays \--- | isa.assay.xlsx + | isa.datamap.xlsx \--- dataset \--- protocol [optional / add. payload] \--- workflows @@ -153,12 +155,16 @@ The `study` file MUST follow the [ISA-XLSX study file specification](ISA-XLSX.md Protocols that are necessary to describe the sample or material creating process can be placed under the protocols directory. +Further explications about data entities defined in the assay MAY be stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.datamap.xlsx` file, which SHOULD exist for each assay. Further details on `isa.datamap.xlsx` are specified [in the isa-xlsx specification](ISA-XLSX.md#datamap-file). + ## Assay Data and Metadata All measurement data sets are considered as assays and are considered immutable input data. Assay data MUST be placed into a unique subdirectory of the top-level `assays` subdirectory. All ISA metadata specific to a single assay MUST be annotated in the file `isa.assay.xlsx` at the root of the assay's subdirectory. This workbook MUST contain a single assay that can be organized in one or multiple worksheets. The `assay` file MUST follow the [ISA-XLSX assay file specification](ISA-XLSX.md#assay-file). +Further explications about data entities defined in the assay MAY be stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.datamap.xlsx` file, which SHOULD exist for each assay. Further details on `isa.datamap.xlsx` are specified [in the isa-xlsx specification](ISA-XLSX.md#datamap-file). + Notes: - There are no requirements on specific assay-level metadata per formal ARC definition. Conversion of ARCs into other repository or archival formats (e.g. PRIDE, GEO, ENA) may however mandate the presence of specific terms required in the destination format. From 332935294b95c8e0a679894369dff10e963a1303 Mon Sep 17 00:00:00 2001 From: HLWeil Date: Thu, 6 Jun 2024 10:10:42 +0200 Subject: [PATCH 32/35] add manage-issues github workflow --- .github/workflows/manage-issues.yml | 31 +++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) create mode 100644 .github/workflows/manage-issues.yml diff --git a/.github/workflows/manage-issues.yml b/.github/workflows/manage-issues.yml new file mode 100644 index 0000000..34b85b6 --- /dev/null +++ b/.github/workflows/manage-issues.yml @@ -0,0 +1,31 @@ +name: Manage issues + +on: + issues: + types: + - opened + - reopened + +jobs: + label_issues: + runs-on: ubuntu-latest + permissions: + issues: write + steps: + - run: gh issue edit "$NUMBER" --add-label "$LABELS" + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + GH_REPO: ${{ github.repository }} + NUMBER: ${{ github.event.issue.number }} + LABELS: "Status: Needs Triage" + + add-to-project: + name: Add issue to project + runs-on: ubuntu-latest + steps: + - uses: actions/add-to-project@v1.0.1 + with: + # You can target a project in a different organization + # to the issue + project-url: https://github.com/orgs/nfdi4plants/projects/10 + github-token: ${{ secrets.ADD_TO_PROJECT_PAT }} From 4b4c29ce5dcca3fc16959f7c91e4d744735b5a83 Mon Sep 17 00:00:00 2001 From: Kevin Schneider Date: Thu, 6 Jun 2024 11:45:03 +0200 Subject: [PATCH 33/35] Add validation_summary.json specs, package metadata, and ARC apps --- ARC specification.md | 158 ++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 157 insertions(+), 1 deletion(-) diff --git a/ARC specification.md b/ARC specification.md index 49431bf..fbccc44 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -40,6 +40,7 @@ Licensed under the Creative Commons License CC BY, Version 4.0; you may not use - [Continuous quality control](#continuous-quality-control) - [The cqc branch](#the-cqc-branch) - [The validation\_packages.yml file](#the-validation_packagesyml-file) + - [ARC Apps](#arc-apps) - [Reference implementation](#reference-implementation-1) - [Best Practices](#best-practices) - [Community Specific Data Formats](#community-specific-data-formats) @@ -354,11 +355,128 @@ Validation packages - SHOULD be versioned using [semantic versioning](https://semver.org/) +- MUST be enriched with the following mandatory metadata in an appropriate way (e.g. via yaml frontmatter, tables in a database, etc.): + | Field | Type | Description | + | --- | --- | --- | + | Name | string | the name of the package | + | Version | string | the version of the package | + | Summary | string | a single sentence description (<=50 words) of the package | + | Description | string | an unconstrained free text description of the package | + +- MAY be enriched with the following optional metadata in an appropriate way (e.g. via yaml frontmatter, tables in a database, etc.): + | Field | Type | Description | + | --- | --- | --- | + | HookEndpoint | string | An URL to trigger subsequent events based on the result of executing the validation package in a CQC context, see [Continuous quality control](#continuous-quality-control) and [ARC Apps](#arc-apps) | + +- MAY be enriched with any additional metadata in an appropriate way (e.g. via yaml frontmatter, tables in a database, etc.). + - MUST create a `validation_report.*` file upon execution that summarizes the results of validating the ARC against the cases defined in the validation package. The format of this file SHOULD be of an established test result format such as [JUnit XML](https://github.com/windyroad/JUnit-Schema) or [TAP](https://testanything.org/). - MUST create a `badge.svg` file upon execution that visually summarizes the results of validating the ARC against the validation cases defined in the validation package. - The information displayed SHOULD be derivable from the `validation_report.*` file and MUST include the _name_ of the validation package. + The information displayed SHOULD be derivable from the `validation_report.*` file and MUST include the _Name_ of the validation package. + +- MUST create a `validation_summary.json` file upon execution, which contains the mandatory and optional metadata specified above, and a high-level summary of the execution of the validation package following this schema: +
+ validation_summary.json schema + + ```json + { + "$schema": "http://json-schema.org/draft-04/schema#", + "type": "object", + "properties": { + "Critical": { + "type": "object", + "properties": { + "HasFailures": { + "type": "boolean" + }, + "Total": { + "type": "integer" + }, + "Passed": { + "type": "integer" + }, + "Failed": { + "type": "integer" + }, + "Errored": { + "type": "integer" + } + }, + "required": [ + "HasFailures", + "Total", + "Passed", + "Failed", + "Errored" + ] + }, + "NonCritical": { + "type": "object", + "properties": { + "HasFailures": { + "type": "boolean" + }, + "Total": { + "type": "integer" + }, + "Passed": { + "type": "integer" + }, + "Failed": { + "type": "integer" + }, + "Errored": { + "type": "integer" + } + }, + "required": [ + "HasFailures", + "Total", + "Passed", + "Failed", + "Errored" + ] + }, + "ValidationPackage": { + "type": "object", + "properties": { + "Name": { + "type": "string" + }, + "Version": { + "type": "string" + }, + "Summary": { + "type": "string" + }, + "Description": { + "type": "string" + }, + "HookEndpoint": { + "type": "string" + } + }, + "required": [ + "Name", + "Version", + "Summary", + "Description" + ] + } + }, + "required": [ + "Critical", + "NonCritical", + "ValidationPackage" + ] + } + ``` + +
+ +- SHOULD aggregate the result files in an appropriately named subdirectory. ### Reference implementation @@ -446,10 +564,48 @@ validation_packages: - name: package3 ``` +### ARC Apps + +Continuous Quality Control enables to check at any time in the ARC life cycle whether it passes certain criteria or not. + +However, **if** an ARC is valid for a given _target_ is only half of the equation - the other being taking some kind of action based on this information. One large field of actions here is the publication of the ARC or (some) of it's contents to an **endpoint repository (ER)** (e.g. [PRIDE](https://www.ebi.ac.uk/pride/), [ENA](https://www.ebi.ac.uk/ena/browser/home)). + +In this example, a validation package SHOULD only determine if the content _COULD_ be published to the ER, and a subsequent service SHOULD then take the respective action based on the reported result of that package (e.g. fixing errors based on the report, or publish the content to the ER). + +**ARC apps** are services that provide URLs called _(CQC) Hook Endpoints_ that be triggered manually or by the result of a validation package. They are intended to automate the process of taking action based on the result of a validation package. + ### Reference implementation PLANTDataHUB performs Continuous Quality Control of ARCs using the [arc-validate software suite](https://github.com/nfdi4plants/arc-validate) as described in our 2023 paper [PLANTdataHUB: a collaborative platform for continuous FAIR data sharing in plant research](https://doi.org/10.1111/tpj.16474). +The following sequence diagram shows the conceptual implementation of CQC pipelines in conjunction with ARC Apps connected via CQC Hooks on the reference DataHUB instance with the following participants: + +- **User**: The user who works on an ARC published on the DataHUB +- **ARC**: The ARC repository on the DataHUB +- **DataHUB**: The DataHUB instance +- **ARC App**: A service that provides a CQC Hook Endpoint to perform actions based on validation results and/or user input + +```mermaid +sequenceDiagram + + participant User + participant ARC + participant DataHUB + participant ARC App + + Note over User, DataHUB: Validation (CQC pipeline) + User ->> ARC : commit + DataHUB ->> DataHUB : trigger validation for commit + DataHUB ->> ARC : commit validation results
to cqc branch + DataHUB ->> ARC : create badge + Note over User, ARC App: QCQ Hooks + User ->> ARC App : click on badge link + DataHUB ->> ARC App : trigger some action based on validation results + ARC App ->> DataHUB : Request relevant information + DataHUB ->> ARC App : send relevant information (when granted access) + ARC App ->> ARC App : Perform action with retrieved data +``` + # Best Practices In the next section we provide you with Best Practices to make the use of an ARC even more efficient and valuable for open science. From 53439efdc90aecb0091db6d241baa73ab587d377 Mon Sep 17 00:00:00 2001 From: Kevin Schneider Date: Fri, 7 Jun 2024 16:15:01 +0200 Subject: [PATCH 34/35] typo fix --- ARC specification.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ARC specification.md b/ARC specification.md index fbccc44..aaff5e8 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -598,7 +598,7 @@ sequenceDiagram DataHUB ->> DataHUB : trigger validation for commit DataHUB ->> ARC : commit validation results
to cqc branch DataHUB ->> ARC : create badge - Note over User, ARC App: QCQ Hooks + Note over User, ARC App: CQC Hooks User ->> ARC App : click on badge link DataHUB ->> ARC App : trigger some action based on validation results ARC App ->> DataHUB : Request relevant information From 811870ea1ac84fdb7e5c08dad55f6bfb99cdc5dc Mon Sep 17 00:00:00 2001 From: HLWeil Date: Mon, 10 Jun 2024 15:27:00 +0200 Subject: [PATCH 35/35] small changes and additions according to PR review for v2.0.0 --- ARC specification.md | 37 ++++++++++++++++++++++++++++--------- ISA-XLSX.md | 2 +- 2 files changed, 29 insertions(+), 10 deletions(-) diff --git a/ARC specification.md b/ARC specification.md index aaff5e8..2c822b6 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -76,10 +76,9 @@ ARCs are based on a strict separation of data and metadata content into study ma Each ARC is a directory containing the following elements: -- *Studies* are collections of material and resources used within the investigation. -Metadata that describe the characteristics of material and resources follow the ISA study model. Study-level metadata is stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.study.xlsx` file, which MUST exist to specify the input material or data resources. Resources MAY include biological materials (e.g. plant samples, analytical standards) created during the current investigation. Resources MAY further include external data (e.g., knowledge files, results files) that need to be included and cannot be referenced due to external limitations. Resources described in a study file can be the input for one or multiple assays. Further details on `isa.study.xlsx` are specified [below](#study-and-resources). Resource (descriptor) files MUST be placed in a `resources` subdirectory. Further explications about data entities defined in the study MAY be stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.datamap.xlsx` file, which SHOULD exist for studies containing data. Further details on `isa.datamap.xlsx` are specified [in the isa-xlsx specification](ISA-XLSX.md#datamap-file). +- *Studies* are collections of material and resources used within the investigation. Study-level metadata is stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.study.xlsx` file, which MUST exist to specify the input material or data resources. Resources MAY include biological materials (e.g. plant samples, analytical standards) created during the current investigation. Resources MAY further include external data (e.g., knowledge files, results files) that need to be included and cannot be referenced due to external limitations. Resources described in a study file can be the input for one or multiple assays. Further details on `isa.study.xlsx` are specified [below](#study-and-resources). Resource (descriptor) files MUST be placed in a `resources` subdirectory. Further explications about data entities defined in the study are stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.datamap.xlsx` file, which SHOULD exist for studies containing data. Further details on `isa.datamap.xlsx` are specified [in the isa-xlsx specification](ISA-XLSX.md#datamap-file). -- *Assays* correspond to outcomes of experimental assays or analytical measurements (in the interpretation of the ISA model) and are treated as immutable data. Each assay is a collection of files, together with a corresponding metadata file, stored in a subdirectory of the top-level subdirectory `assays`. Assay-level metadata is stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.assay.xlsx` file, which MUST exist for each assay. Further details on `isa.assay.xlsx` are specified [below](#assay-data-and-metadata). Assay data files MUST be placed in a `dataset` subdirectory. Further explications about data entities defined in the assay MAY be stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.datamap.xlsx` file, which SHOULD exist for each assay. Further details on `isa.datamap.xlsx` are specified [in the isa-xlsx specification](ISA-XLSX.md#datamap-file). +- *Assays* correspond to outcomes of experimental assays or analytical measurements (in the interpretation of the ISA model) and are treated as immutable data. Each assay is a collection of files, together with a corresponding metadata file, stored in a subdirectory of the top-level subdirectory `assays`. Assay-level metadata is stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.assay.xlsx` file, which MUST exist for each assay. Further details on `isa.assay.xlsx` are specified [below](#assay-data-and-metadata). Assay data files MUST be placed in a `dataset` subdirectory. Further explications about data entities defined in the assay are stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.datamap.xlsx` file, which SHOULD exist for each assay. Further details on `isa.datamap.xlsx` are specified [in the isa-xlsx specification](ISA-XLSX.md#datamap-file). - *Workflows* represent data analysis routines (in the sense of CWL tools and workflows) and are a collection of files, together with a corresponding CWL description, stored in a single directory under the top-level `workflows` subdirectory. A per-workflow executable CWL description is stored in `workflow.cwl`, which MUST exist for all ARC workflows. Further details on workflow descriptions are given [below](#workflow-description). @@ -103,13 +102,13 @@ Note: \--- studies \--- | isa.study.xlsx - | isa.datamap.xlsx + | isa.datamap.xlsx [optional] \--- resources \--- protocol [optional / add. payload] \--- assays \--- | isa.assay.xlsx - | isa.datamap.xlsx + | isa.datamap.xlsx [optional] \--- dataset \--- protocol [optional / add. payload] \--- workflows @@ -255,7 +254,7 @@ The file `arc.cwl` SHOULD exist at the root directory of each ARC. It describes All metadata references to files or directories located inside the ARC MUST follow the following patterns: -- The `general pattern`, which is universally applicable and SHOULD be used is to specify the path relative to the ARC root. +- The `general pattern`, which is universally applicable and SHOULD be used to specify the path relative to the ARC root. - The `folder specific pattern`, which MAY be used only in specific metadata contexts: - Data nodes in `isa.assay.xlsx` files: The path MAY be specified relative to the `dataset` sub-folder of the assay @@ -265,22 +264,42 @@ All metadata references to files or directories located inside the ARC MUST foll #### General Pattern -In this example, there are two `assays`, with `Assay1`containing a measurement of a `Source` material, producing an output `Data`. `Assay2` references this `Data` for producing a new `Data`. +In this example, there are two `assays`, with `Assay1` containing a measurement of a `Source` material, producing an output `Data`. `Assay2` references this `Data` for producing a new `Data`. Use of `general pattern` relative paths from the arc root folder: `assays/Assay1/isa.assay.xlsx`: -| Input [Source Name] | Parameter[Instrument model] | Output [Data] | +| Input [Source Name] | Component [Instrument model] | Output [Data] | |-------------|---------------------------------|----------------------------------| | input | Bruker 500 Avance | assays/Assay1/dataset/measurement.txt | `assays/Assay2/isa.assay.xlsx`: -| Input [Data] | Parameter[script file] | Output [Data] | +| Input [Data] | Component [script file] | Output [Data] | |----------------------------------|---------------------------------|----------------------------------| | assays/Assay1/dataset/measurement.txt | assays/Assay2/dataset/script.sh | assays/Assay2/dataset/result.txt | +#### Folder Specific Pattern + +In this example, there are two `assays`, with `Assay1` containing a measurement of a `Source` material, producing an output `Data`. `Assay2` references this `Data` for producing a new `Data`. + +Use of `folder specific pattern` relative paths from `Assay1` and `Assay2` `Dataset` folders, respectively: + +`assays/Assay1/isa.assay.xlsx`: + +| Input [Source Name] | Component [Instrument model] | Output [Data] | +|-------------|---------------------------------|----------------------------------| +| input | Bruker 500 Avance | measurement.txt | + +`assays/Assay2/isa.assay.xlsx`: + +| Input [Data] | Component [script file] | Output [Data] | +|----------------------------------|---------------------------------|----------------------------------| +| assays/Assay1/dataset/measurement.txt | script.sh | result.txt | + +Note, that to reference `Data` which is part of `Assays1` in `Assay2`, the `general pattern` is necessary either way. Therefore it is considered the more broadly applicable and recommended pattern. + # Shareable and Publishable ARCs ARCs can be shared in any state. They are considered *publishable* (e.g. for the purpose of minting a DOI) when fulfilling the following conditions: diff --git a/ISA-XLSX.md b/ISA-XLSX.md index 693525a..69de5d7 100644 --- a/ISA-XLSX.md +++ b/ISA-XLSX.md @@ -791,7 +791,7 @@ If we pool two sources into a single sample, we might represent this as: In the `Datamap Table sheets`, column headers MUST have the first letter of each word in upper case, with the exception of the referencing label (REF). -The content of the datamap table MUST be placed in an `xlsx table` whose name starts with `datamapTable`. Each sheet MUST contain at most one such annotation table. Only cells inside this table are considered as part of the formatted metadata. +The content of the datamap table MUST be placed in an `xlsx table` whose name equals `datamapTable`. Each sheet MUST contain at most one such datamap table. Only cells inside this table are considered as part of the formatted metadata. `Datamap Table sheets` are structured with fields organized on a per-row basis. The first row MUST be used for column headers. Each body row is an implementation of a `data` node.