diff --git a/.github/workflows/manage-issues.yml b/.github/workflows/manage-issues.yml new file mode 100644 index 0000000..34b85b6 --- /dev/null +++ b/.github/workflows/manage-issues.yml @@ -0,0 +1,31 @@ +name: Manage issues + +on: + issues: + types: + - opened + - reopened + +jobs: + label_issues: + runs-on: ubuntu-latest + permissions: + issues: write + steps: + - run: gh issue edit "$NUMBER" --add-label "$LABELS" + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + GH_REPO: ${{ github.repository }} + NUMBER: ${{ github.event.issue.number }} + LABELS: "Status: Needs Triage" + + add-to-project: + name: Add issue to project + runs-on: ubuntu-latest + steps: + - uses: actions/add-to-project@v1.0.1 + with: + # You can target a project in a different organization + # to the issue + project-url: https://github.com/orgs/nfdi4plants/projects/10 + github-token: ${{ secrets.ADD_TO_PROJECT_PAT }} diff --git a/ARC specification.md b/ARC specification.md index 2894393..07d5981 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -75,9 +75,9 @@ ARCs are based on a strict separation of data and metadata content into study ma Each ARC is a directory containing the following elements: - *Studies* are collections of material and resources used within the investigation. -Metadata that describe the characteristics of material and resources follow the ISA study model. Study-level metadata is stored in [ISA-XLSX](#isa-xlsx-format) format in a file `isa.study.xlsx`, which MUST exist to specify the input material or data resources. Resources MAY include biological materials (e.g. plant samples, analytical standards) created during the current investigation. Resources MAY further include external data (e.g., knowledge files, results files) that need to be included and cannot be referenced due to external limitations. Resources described in a study file can be the input for one or multiple assays. Further details on `isa.study.xlsx` are specified [below](#study-and-resources). Resource (descriptor) files MUST be placed in a `resources` subdirectory. +Metadata that describe the characteristics of material and resources follow the ISA study model. Study-level metadata is stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.study.xlsx` file, which MUST exist to specify the input material or data resources. Resources MAY include biological materials (e.g. plant samples, analytical standards) created during the current investigation. Resources MAY further include external data (e.g., knowledge files, results files) that need to be included and cannot be referenced due to external limitations. Resources described in a study file can be the input for one or multiple assays. Further details on `isa.study.xlsx` are specified [below](#study-and-resources). Resource (descriptor) files MUST be placed in a `resources` subdirectory. Further explications about data entities defined in the study MAY be stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.datamap.xlsx` file, which SHOULD exist for studies containing data. Further details on `isa.datamap.xlsx` are specified [in the isa-xlsx specification](ISA-XLSX.md#datamap-file). -- *Assays* correspond to outcomes of experimental assays or analytical measurements (in the interpretation of the ISA model) and are treated as immutable data. Each assay is a collection of files, together with a corresponding metadata file, stored in a subdirectory of the top-level subdirectory `assays`. Assay-level metadata is stored in [ISA-XLSX](#isa-xlsx-format) format in a file `isa.assay.xlsx`, which MUST exist for each assay. Further details on `isa.assay.xlsx` are specified [below](#assay-data-and-metadata). Assay data files MUST be placed in a `dataset` subdirectory. +- *Assays* correspond to outcomes of experimental assays or analytical measurements (in the interpretation of the ISA model) and are treated as immutable data. Each assay is a collection of files, together with a corresponding metadata file, stored in a subdirectory of the top-level subdirectory `assays`. Assay-level metadata is stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.assay.xlsx` file, which MUST exist for each assay. Further details on `isa.assay.xlsx` are specified [below](#assay-data-and-metadata). Assay data files MUST be placed in a `dataset` subdirectory. Further explications about data entities defined in the assay MAY be stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.datamap.xlsx` file, which SHOULD exist for each assay. Further details on `isa.datamap.xlsx` are specified [in the isa-xlsx specification](ISA-XLSX.md#datamap-file). - *Workflows* represent data analysis routines (in the sense of CWL tools and workflows) and are a collection of files, together with a corresponding CWL description, stored in a single directory under the top-level `workflows` subdirectory. A per-workflow executable CWL description is stored in `workflow.cwl`, which MUST exist for all ARC workflows. Further details on workflow descriptions are given [below](#workflow-description). @@ -101,11 +101,13 @@ Note: \--- studies \--- | isa.study.xlsx + | isa.datamap.xlsx \--- resources \--- protocol [optional / add. payload] \--- assays \--- | isa.assay.xlsx + | isa.datamap.xlsx \--- dataset \--- protocol [optional / add. payload] \--- workflows @@ -153,12 +155,16 @@ The `study` file MUST follow the [ISA-XLSX study file specification](ISA-XLSX.md Protocols that are necessary to describe the sample or material creating process can be placed under the protocols directory. +Further explications about data entities defined in the assay MAY be stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.datamap.xlsx` file, which SHOULD exist for each assay. Further details on `isa.datamap.xlsx` are specified [in the isa-xlsx specification](ISA-XLSX.md#datamap-file). + ## Assay Data and Metadata All measurement data sets are considered as assays and are considered immutable input data. Assay data MUST be placed into a unique subdirectory of the top-level `assays` subdirectory. All ISA metadata specific to a single assay MUST be annotated in the file `isa.assay.xlsx` at the root of the assay's subdirectory. This workbook MUST contain a single assay that can be organized in one or multiple worksheets. The `assay` file MUST follow the [ISA-XLSX assay file specification](ISA-XLSX.md#assay-file). +Further explications about data entities defined in the assay MAY be stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.datamap.xlsx` file, which SHOULD exist for each assay. Further details on `isa.datamap.xlsx` are specified [in the isa-xlsx specification](ISA-XLSX.md#datamap-file). + Notes: - There are no requirements on specific assay-level metadata per formal ARC definition. Conversion of ARCs into other repository or archival formats (e.g. PRIDE, GEO, ENA) may however mandate the presence of specific terms required in the destination format. diff --git a/ISA-XLSX.md b/ISA-XLSX.md index edfbb8b..693525a 100644 --- a/ISA-XLSX.md +++ b/ISA-XLSX.md @@ -9,6 +9,7 @@ This document describes the ISA Abstract Model reference implementation specifie - [Investigation File](#investigation-file) - [Study File](#study-file) - [Assay File](#assay-file) +- [Datamap File](#datamap-file) - [Top-level metadata sheets](#top-level-metadata-sheets) - [Ontology Source Reference section](#ontology-source-reference-section) - [INVESTIGATION section](#investigation-section) @@ -24,7 +25,17 @@ This document describes the ISA Abstract Model reference implementation specifie - [Components](#components) - [Parameters](#parameters) - [Comments](#comments) - - [Examples](#examples) + - [Examples](#examples-1) +- [Datamap Table sheets](#datamap-table-sheets) + - [Data](#data-column) + - [Explication](#explication-column) + - [Unit](#unit-column) + - [Object Type](#object-type-column) + - [Description](#description-column) + - [Generated By](#generated-by-column) + - [Comment](#comments-1) + - [Examples](#examples-2) + Below we provide the schemas and the content rules for valid ISA-XLSX documents. @@ -124,6 +135,16 @@ Therefore, the main entities of the `Assay File` should be `Samples` and `Data`. The `Assay File` implements the [`Assay`](https://isa-specs.readthedocs.io/en/latest/isamodel.html#assay) graph from the ISA Abstract Model. +# Datamap File + +The `Datamap` represents a set of explanations about the `data` entities defined in `assays` and `studies`. + +The `Datamap File` MUST contain one [`Datamap table sheet`](#datamap-table-sheets). This sheet MUST be named `isa_datamap`. + +Therefore, the main entities of the `Datamap File` should be `Data`. + +The `Datamap File` acts as an extension of the `data` nodes defined in the [`Study and Assay graphs section`](https://isa-specs.readthedocs.io/en/latest/isamodel.html#study-and-assay-graphs) from the ISA Abstract Model. + # Top-level metadata sheets The purpose of top-level metadata sheets is aggregating and listing top-level metadata. Each sheet consists of sections consisting of a section header and key-value fields. Section headers MUST be completely written in upper case (e.g. STUDY), field headers MUST have the first letter of each word in upper case (e.g. Study Identifier); with the exception of the referencing label (REF). @@ -608,6 +629,8 @@ For example, the `ASSAY PERFORMERS` section of an ISA-XLSX `isa.assay.xlsx` file # Annotation Table sheets +`Annotation Table sheets` are used to describe the experimental flow in detailed, machine readable way. In each sheet, there is a mapping from input entities to output entities, placed in the `Input` and `Output` columns, accordingly. The other columns then are used to either describe those entities or the processes that led to this mapping. + In the `Annotation Table sheets`, column headers MUST have the first letter of each word in upper case, with the exception of the referencing label (REF). The content of the annotation table MUST be placed in an `xlsx table` whose name starts with `annotationTable`. Each sheet MUST contain at most one such annotation table. Only cells inside this table are considered as part of the formatted metadata. @@ -760,4 +783,85 @@ If we pool two sources into a single sample, we might represent this as: | Input [Source Name] | Protocol REF | Output [Sample Name] | |---------------|-------------------|---------------| | source1 | sample collection | sample1 | -| source2 | sample collection | sample1 | \ No newline at end of file +| source2 | sample collection | sample1 | + +# Datamap Table sheets + +`Datamap Table sheets` are used to describe the contents of data files. + +In the `Datamap Table sheets`, column headers MUST have the first letter of each word in upper case, with the exception of the referencing label (REF). + +The content of the datamap table MUST be placed in an `xlsx table` whose name starts with `datamapTable`. Each sheet MUST contain at most one such annotation table. Only cells inside this table are considered as part of the formatted metadata. + +`Datamap Table sheets` are structured with fields organized on a per-row basis. The first row MUST be used for column headers. Each body row is an implementation of a `data` node. + +## Data column + +Every `Datamap Table sheet` MUST contain a `Data` column. Every object in this column MUST correspond to a relevant data resource location, following the [Data Path Annotation](/ARC%20specification.md#data-path-annotation) patterns. If the annotation of the `Data` node refers not to the complete resource, but a part of it, a `Selector` MAY be added. This Selector MUST be separated from the resource location using a `#`— with no whitespace between: `location#selector`. If appropriate, the Selector SHOULD be formatted according to IRI fragment selectors specified by [W3](https://www.w3.org/TR/annotation-model/#fragment-selector). + +The format of the data resource MAY be further qualified using a `Data Format` column. The `Data Format` SHOULD be expressed using a [MIME format](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types), most commonly consisting of two parts: a type and a subtype, separated by a slash (/) — with no whitespace between: `type/subtype`. If appropriate, a format from the list composed by [IANA](https://www.iana.org/assignments/media-types/media-types.xhtml) +SHOULD be picked. Unregistered or niche encoding and file formats MAY be indicated instead via the most appropriate URL. + +The format and usage info about the Selector MAY be further qualified using a `Data Selector Format` column. The `Data Selector Format` SHOULD point to a web resource containing instructions about how the Selector is formatted and how it should be interpreted. + +## Explication column + +Every `Datamap Table sheet` SHOULD contain an `Explication` column. The `Explication` adds explicit meaning to the data node. The value MUST be free text, or an [`Ontology Annotation`](#ontology-annotations). + +| Explication | Term Source REF | Term Accession Number | +|------------------------|-------------------|-------------------------| +| average value | OBI | [http://…/obo/OBI_0000679](http://purl.obolibrary.org/obo/OBI_0000679) | + +## Unit column + +Every `Datamap Table sheet` SHOULD contain an `Unit` column. The `Unit` adds a unit of measurement to the data node. The value MUST be free text, or an [`Ontology Annotation`](#ontology-annotations). + +| Unit | Term Source REF | Term Accession Number | +|------------------------|-------------------|-------------------------| +| milligram per milliliter | UO | [http://…/obo/UO_0000176](http://purl.obolibrary.org/obo/UO_0000176) | + +## Object Type column + +Every `Datamap Table sheet` SHOULD contain an `Object Type` column. The `Object Type` defines the shape or format in which the data node is represented. The value MUST be free text, or an [`Ontology Annotation`](#ontology-annotations). + +| Object Type | Term Source REF | Term Accession Number | +|------------------------|-------------------|-------------------------| +| Float | NCIT | [http://…/obo/NCIT_C48150](http://purl.obolibrary.org/obo/NCIT_C48150) | + +## Description column + +Every `Datamap Table sheet` SHOULD contain a `Description` column. The `Description` gives additional, humand readable context about the data node. The value MUST be free text. + +| Description | +|------------------------| +| The average protein concentration for the given gene | + +## Generated By column + +Every `Datamap Table sheet` SHOULD contain a `Generated By` column. The `Generated By` names the tool which led to the creation of the data node. The value MUST be free text. + +If possible, the value in this column MUST correspond to a relevant data resource location, following the [Data Path Annotation](/ARC%20specification.md#data-path-annotation) patterns. + +| Generated By | +|------------------------| +| GeneStatisticsTool.exe | + +## Comments + +A `Comment` can be used to provide some additional information. Columns headed with `Comment[]` MAY appear anywhere in the Annotation Table. The comment always refers to the Annotation Table. The value MUST be free text. + +| Comment [Answer to everything] | +|--------------------------------| +| forty-two | + +## Examples + +For example, a simple `datamap` table representing a tabular datafile might look as follows: + +| Data | Explication | Term Source REF | Term Accession Number | Unit | Term Source REF | Term Accession Number | Object Type | Term Source REF | Term Accession Number | Description |GeneratedBy | +|---------------|---------------|-------------------|---------------|---------------|-------------------|---------------|---------------|-------------------|---------------|---------------|---------------| +| MyData.csv#col=1 | Gene Identifier | NCIT | [http://…/obo/NCIT_C48664](http://purl.obolibrary.org/obo/NCIT_C48664) | | | | String | NCIT | [http://…/obo/NCIT_C45253](http://purl.obolibrary.org/obo/NCIT_C45253) | Short hand identifier of the gene coding for the protein. | GeneStatisticsTool.exe | +| MyData.csv#col=2 | average value | OBI | [http://…/obo/OBI_0000679](http://purl.obolibrary.org/obo/OBI_0000679) | milligram per milliliter | UO | [http://…/obo/UO_0000176](http://purl.obolibrary.org/obo/UO_0000176) | Float | NCIT | [http://…/obo/NCIT_C48150](http://purl.obolibrary.org/obo/NCIT_C48150) | The average protein concentration for the given gene |GeneStatisticsTool.exe | +| MyData.csv#col=3 | p-value | OBI | [http://…/obo/OBI_0000175](http://purl.obolibrary.org/obo/OBI_0000175) | | | | Float | NCIT | [http://…/obo/NCIT_C48150](http://purl.obolibrary.org/obo/NCIT_C48150) | p-value of t-test against control. | GeneStatisticsTool.exe | + +In this example, the `datamap` table describes a single data file named `MyData.csv`. This file contains three columns. The first column contains gene identifiers, the other two results of a statistical analysis performed by the tool GeneStatisticsTool.exe. \ No newline at end of file