diff --git a/.github/workflows/manage-issues.yml b/.github/workflows/manage-issues.yml new file mode 100644 index 0000000..34b85b6 --- /dev/null +++ b/.github/workflows/manage-issues.yml @@ -0,0 +1,31 @@ +name: Manage issues + +on: + issues: + types: + - opened + - reopened + +jobs: + label_issues: + runs-on: ubuntu-latest + permissions: + issues: write + steps: + - run: gh issue edit "$NUMBER" --add-label "$LABELS" + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + GH_REPO: ${{ github.repository }} + NUMBER: ${{ github.event.issue.number }} + LABELS: "Status: Needs Triage" + + add-to-project: + name: Add issue to project + runs-on: ubuntu-latest + steps: + - uses: actions/add-to-project@v1.0.1 + with: + # You can target a project in a different organization + # to the issue + project-url: https://github.com/orgs/nfdi4plants/projects/10 + github-token: ${{ secrets.ADD_TO_PROJECT_PAT }} diff --git a/ARC specification.md b/ARC specification.md index b5517a7..2c822b6 100644 --- a/ARC specification.md +++ b/ARC specification.md @@ -8,34 +8,47 @@ This specification is Copyright 2022 by [DataPLANT](https://nfdi4plants.de). Licensed under the Creative Commons License CC BY, Version 4.0; you may not use this file except in compliance with the License. You may obtain a copy of the License at https://creativecommons.org/about/cclicenses/. This license allows re-users to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. Credit must be given to the creator. -## Table of Contents +# Table of Contents - [Annotated Research Context Specification, v1.2](#annotated-research-context-specification-v12) - - [Introduction](#introduction) - - [Extensions](#extensions) - - [ARC Structure and Content](#arc-structure-and-content) - - [High-Level Schema](#high-level-schema) - - [Example ARC structure](#example-arc-structure) - - [ARC Representation](#arc-representation) - - [ISA-XLSX Format](#isa-xlsx-format) - - [Study and Resources](#study-and-resources) - - [Assay Data and Metadata](#assay-data-and-metadata) - - [Workflow Description](#workflow-description) - - [Run Description](#run-description) - - [Additional Payload](#additional-payload) - - [Top-level Metadata and Workflow Description](#top-level-metadata-and-workflow-description) - - [Investigation and Study Metadata](#investigation-and-study-metadata) - - [Top-Level Run Description](#top-level-run-description) - - [Shareable and Publishable ARCs](#shareable-and-publishable-arcs) - - [Reproducible ARCs](#reproducible-arcs) - - [Mechanism for Quality Control of ARCs](#mechanism-for-quality-control-of-arcs) - - [Best Practices](#best-practices) - - [Community Specific Data Formats](#community-specific-data-formats) - - [Compression and Encryption](#compression-and-encryption) - - [Directory and File Naming Conventions](#directory-and-file-naming-conventions) - - [Appendix: Conversion of ARCs to RO Crates](#appendix-conversion-of-arcs-to-ro-crates) - -## Introduction +- [Table of Contents](#table-of-contents) +- [Introduction](#introduction) + - [Extensions](#extensions) +- [ARC Structure and Content](#arc-structure-and-content) + - [High-Level Schema](#high-level-schema) + - [Example ARC structure](#example-arc-structure) + - [ARC Representation](#arc-representation) + - [ISA-XLSX Format](#isa-xlsx-format) + - [Study and Resources](#study-and-resources) + - [Assay Data and Metadata](#assay-data-and-metadata) + - [Workflow Description](#workflow-description) + - [Run Description](#run-description) + - [Additional Payload](#additional-payload) + - [Top-level Metadata and Workflow Description](#top-level-metadata-and-workflow-description) + - [Investigation and Study Metadata](#investigation-and-study-metadata) + - [Top-Level Run Description](#top-level-run-description) + - [Data Path Annotation](#data-path-annotation) + - [Examples](#examples) + - [General Pattern](#general-pattern) +- [Shareable and Publishable ARCs](#shareable-and-publishable-arcs) +- [Reproducible ARCs](#reproducible-arcs) +- [Mechanisms for ARC Quality Control](#mechanisms-for-arc-quality-control) + - [Validation](#validation) + - [Validation cases](#validation-cases) + - [Validation packages](#validation-packages) + - [Reference implementation](#reference-implementation) + - [Continuous quality control](#continuous-quality-control) + - [The cqc branch](#the-cqc-branch) + - [The validation\_packages.yml file](#the-validation_packagesyml-file) + - [ARC Apps](#arc-apps) + - [Reference implementation](#reference-implementation-1) +- [Best Practices](#best-practices) + - [Community Specific Data Formats](#community-specific-data-formats) + - [Compression and Encryption](#compression-and-encryption) + - [Directory and File Naming Conventions](#directory-and-file-naming-conventions) +- [Appendix: Conversion of ARCs to RO Crates](#appendix-conversion-of-arcs-to-ro-crates) + +# Introduction This document describes a specification for a standardized way of creating a working environment and packaging file-based research data and necessary additional contextual information for working, collaboration, preservation, reproduction, re-use, and archiving as well as distribution. This organization unit is named *Annotated Research Context* (ARC) and is designed to be both human and machine actionable. @@ -49,30 +62,29 @@ This specification is intended as a practical guide for software authors to crea The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in [RFC 2119](https://datatracker.ietf.org/doc/html/rfc2119). This specification is based on the [ISA model](https://isa-specs.readthedocs.io/en/latest/isamodel.html) and the [Common Workflow Specification (v1.2)](https://www.commonwl.org/v1.2/). -### Extensions +## Extensions The ARC specification can be extended in a backwards compatible way and will evolve over time. This is accomplished through a community-driven ARC discussion forum and pull request mechanisms. All changes that are not backwards compatible with the current ARC specification will be implemented in ARC specification v2.0. -## ARC Structure and Content +# ARC Structure and Content ARCs are based on a strict separation of data and metadata content into study material (*studies*), measurement and assay outcomes (*assays*), computation results (*runs*) and computational workflows (*workflows*) generating the latter. The scope or granularity of an ARC aligns with the necessities of individual projects or large experimental setups. -### High-Level Schema +## High-Level Schema Each ARC is a directory containing the following elements: -- *Studies* are collections of material and resources used within the investigation. -Metadata that describe the characteristics of material and resources follow the ISA study model. Study-level metadata is stored in [ISA-XLSX](#isa-xlsx-format) format in a file `isa.study.xlsx`, which MUST exist to specify the input material or data resources. Resources MAY include biological materials (e.g. plant samples, analytical standards) created during the current investigation. Resources MAY further include external data (e.g., knowledge files, results files) that need to be included and cannot be referenced due to external limitations. Resources described in a study file can be the input for one or multiple assays. Further details on `isa.study.xlsx` are specified [below](#study-and-resources). Resource (descriptor) files MUST be placed in a `resources` subdirectory. +- *Studies* are collections of material and resources used within the investigation. Study-level metadata is stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.study.xlsx` file, which MUST exist to specify the input material or data resources. Resources MAY include biological materials (e.g. plant samples, analytical standards) created during the current investigation. Resources MAY further include external data (e.g., knowledge files, results files) that need to be included and cannot be referenced due to external limitations. Resources described in a study file can be the input for one or multiple assays. Further details on `isa.study.xlsx` are specified [below](#study-and-resources). Resource (descriptor) files MUST be placed in a `resources` subdirectory. Further explications about data entities defined in the study are stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.datamap.xlsx` file, which SHOULD exist for studies containing data. Further details on `isa.datamap.xlsx` are specified [in the isa-xlsx specification](ISA-XLSX.md#datamap-file). -- *Assays* correspond to outcomes of experimental assays or analytical measurements (in the interpretation of the ISA model) and are treated as immutable data. Each assay is a collection of files, together with a corresponding metadata file, stored in a subdirectory of the top-level subdirectory `assays`. Assay-level metadata is stored in [ISA-XLSX](#isa-xlsx-format) format in a file `isa.assay.xlsx`, which MUST exist for each assay. Further details on `isa.assay.xlsx` are specified [below](#assay-data-and-metadata). Assay data files MUST be placed in a `dataset` subdirectory. +- *Assays* correspond to outcomes of experimental assays or analytical measurements (in the interpretation of the ISA model) and are treated as immutable data. Each assay is a collection of files, together with a corresponding metadata file, stored in a subdirectory of the top-level subdirectory `assays`. Assay-level metadata is stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.assay.xlsx` file, which MUST exist for each assay. Further details on `isa.assay.xlsx` are specified [below](#assay-data-and-metadata). Assay data files MUST be placed in a `dataset` subdirectory. Further explications about data entities defined in the assay are stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.datamap.xlsx` file, which SHOULD exist for each assay. Further details on `isa.datamap.xlsx` are specified [in the isa-xlsx specification](ISA-XLSX.md#datamap-file). - *Workflows* represent data analysis routines (in the sense of CWL tools and workflows) and are a collection of files, together with a corresponding CWL description, stored in a single directory under the top-level `workflows` subdirectory. A per-workflow executable CWL description is stored in `workflow.cwl`, which MUST exist for all ARC workflows. Further details on workflow descriptions are given [below](#workflow-description). - *Runs* capture data products (i.e., outputs of computational analyses) derived from assays, other runs, or study materials using workflows (located in the aforementioned *workflows* subdirectory). Each run is a collection of files, stored in the top-level `runs` subdirectory. It MUST be accompanied by a per-run CWL workflow description, stored in `.cwl` as further described [below](#run-description). -- *Top-level metadata and workflow description* tie together the elements of an ARC in the contexts of investigation and associated studies (in the ISA definition), captured in the file `isa.investigation.xlsx` in [ISA-XLSX format](#isa-xlsx-format), which MUST be present. Furthermore, top-level reproducibility information MUST be provided in the CWL `arc.cwl`, which also MUST exist. +- *Top-level metadata and workflow description* tie together the elements of an ARC in the contexts of investigation and associated studies (in the ISA definition), captured in the file `isa.investigation.xlsx` in [ISA-XLSX format](#isa-xlsx-format), which MUST be present. Furthermore, top-level reproducibility information SHOULD be provided in the CWL `arc.cwl`. All other files contained in an ARC (e.g., a `README.txt`, pre-print PDFs, additional annotation files) are referred to as *additional payload*, and MAY be located anywhere within the ARC structure. However, an ARC MUST be [reproducible](#reproducible-arcs) and [publishable](#shareable-and-publishable-arcs) even if these files are deleted. Further considerations on additional payload are described [below](#additional-payload). @@ -80,7 +92,7 @@ Note: - Subdirectories and other files in the top-level `studies`, `assays`, `workflows`, and `runs` directories are viewed as additional payload unless they are accompanied by the corresponding mandatory description (`isa.study.xlsx`, `isa.assay.xlsx`, `workflow.cwl`, `run.cwl`) specified below. This is intended to allow gradual migration from existing data storage schemes to the ARC schema. For example, *data files* for an assay may be stored in a subdirectory of `assays/`, but are only identified as an assay of the ARC if metadata is present and complete, including a reference from top-level metadata. -### Example ARC structure +## Example ARC structure ``` @@ -90,11 +102,13 @@ Note: \--- studies \--- | isa.study.xlsx + | isa.datamap.xlsx [optional] \--- resources \--- protocol [optional / add. payload] \--- assays \--- | isa.assay.xlsx + | isa.datamap.xlsx [optional] \--- dataset \--- protocol [optional / add. payload] \--- workflows @@ -108,7 +122,7 @@ Note: | run.yml [optional] ``` -### ARC Representation +## ARC Representation ARCs are Git repositories, as defined and supported by the [Git C implementation](https://git-scm.org) (version 2.26 or newer) with [Git-LFS extension](https://git-lfs.github.com) (version 2.12.0), or fully compatible implementations. @@ -126,13 +140,13 @@ Notes: - Removing the `.git` top-level subdirectory (and thereby all provenance information captured within the Git history) from a working copy invalidates an ARC. -### ISA-XLSX Format +## ISA-XLSX Format The ISA-XLSX specification is currently part of the ARC specification. Its version therefore follows the version of the ARC specification. https://github.com/nfdi4plants/ARC-specfication/blob/main/ISA-XLSX.md -### Study and Resources +## Study and Resources The characteristics of all material and resources used within the investigation must be specified in a study. Studies must be placed into a unique subdirectory of the top-level `studies` subdirectory. All ISA metadata specific to a single study MUST be annotated in the file `isa.study.xlsx` at the root of the study's subdirectory. This workbook MUST contain a single resources description that can be organized in one or multiple worksheets. @@ -142,12 +156,16 @@ The `study` file MUST follow the [ISA-XLSX study file specification](ISA-XLSX.md Protocols that are necessary to describe the sample or material creating process can be placed under the protocols directory. -### Assay Data and Metadata +Further explications about data entities defined in the assay MAY be stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.datamap.xlsx` file, which SHOULD exist for each assay. Further details on `isa.datamap.xlsx` are specified [in the isa-xlsx specification](ISA-XLSX.md#datamap-file). + +## Assay Data and Metadata All measurement data sets are considered as assays and are considered immutable input data. Assay data MUST be placed into a unique subdirectory of the top-level `assays` subdirectory. All ISA metadata specific to a single assay MUST be annotated in the file `isa.assay.xlsx` at the root of the assay's subdirectory. This workbook MUST contain a single assay that can be organized in one or multiple worksheets. The `assay` file MUST follow the [ISA-XLSX assay file specification](ISA-XLSX.md#assay-file). +Further explications about data entities defined in the assay MAY be stored in [ISA-XLSX](#isa-xlsx-format) format in a `isa.datamap.xlsx` file, which SHOULD exist for each assay. Further details on `isa.datamap.xlsx` are specified [in the isa-xlsx specification](ISA-XLSX.md#datamap-file). + Notes: - There are no requirements on specific assay-level metadata per formal ARC definition. Conversion of ARCs into other repository or archival formats (e.g. PRIDE, GEO, ENA) may however mandate the presence of specific terms required in the destination format. @@ -162,7 +180,7 @@ Notes: - While assays MAY in principle contain arbitrary data formats, it is highly RECOMMENDED to use community-supported, open formats (see [Best Practices](#best-practices)). -### Workflow Description +## Workflow Description *Workflows* in ARCs are computational steps that are used in computational analysis of an ARC's assays and other data transformation to generate a [run result](#run-description). Typical examples include data cleaning and preprocessing, computational analysis, or visualization. Workflows are used and combined to generate [run results](#run-description), and allow reuse of processing steps across multiple [run results](#run-description). @@ -184,7 +202,7 @@ Notes: - It is strongly encouraged to include author and contributor metadata in tool descriptions and workflow descriptions as [CWL metadata](https://www.commonwl.org/user_guide/17-metadata/index.html). -### Run Description +## Run Description **Runs** in an ARC represent all artefacts that result from some computation on the data within the ARC, i.e. [assays](#assay-data-and-metadata) and [external data](#external-data). These results (e.g. plots, tables, data files, etc. ) MUST reside inside one or more subdirectory of the top-level `runs` directory. @@ -202,7 +220,7 @@ Notes: - It is strongly encouraged to include author and contributor metadata in run descriptions as [CWL metadata](https://www.commonwl.org/user_guide/17-metadata/index.html). -### Additional Payload +## Additional Payload ARCs can include additional payload according to user requirements, e.g. presentations, reading material, or manuscripts. While these files can be placed anywhere in the ARC, it is strongly advised to organize these in additional subdirectories. Especially for the storage of protocols, it is RECOMMENDED to place protocols (assay SOPs) in text form with the corresponding assay in /assays//protocol/. @@ -211,28 +229,78 @@ Note: - All data missing proper annotation (e.g. studies, assays, workflows or runs) is considered additional payload independent of its location within the ARC. -### Top-level Metadata and Workflow Description +## Top-level Metadata and Workflow Description *Top-level metadata and workflow description* tie together the elements of an ARC in the contexts of an investigation captured in the `isa.investigation.xlsx` file, which MUST be present. The `investigation` file MUST follow the [ISA-XLSX investigation file specification](ISA-XLSX.md#investigation-file). -Furthermore, top-level reproducibility information MUST be provided in the CWL `arc.cwl`, which also MUST exist. +Furthermore, top-level reproducibility information SHOULD be provided in the CWL `arc.cwl`. -#### Investigation and Study Metadata +### Investigation and Study Metadata The ARC root directory is identifiable by the presence of the `isa.investigation.xlsx` file in XLSX format. It contains top-level information about the investigation and MUST link all assays and studies within an ARC. Study and assay objects are registered and grouped with an investigation to record other metadata within the relevant contexts. -#### Top-Level Run Description +### Top-Level Run Description -The file `arc.cwl` MUST exist at the root directory of each ARC. It describes which runs are executed (and specifically, their order) to (re)produce the computational outputs contained within the ARC. +The file `arc.cwl` SHOULD exist at the root directory of each ARC. It describes which runs are executed (and specifically, their order) to (re)produce the computational outputs contained within the ARC. `arc.cwl` MUST be a CWL v1.2 workflow description and adhere to the same requirements as [run descriptions](#run-description). In particular, references to study or assay data files, nested workflows MUST use relative paths. An optional file `arc.yml` MAY be provided to specify input parameters. -## Shareable and Publishable ARCs +## Data Path Annotation + +All metadata references to files or directories located inside the ARC MUST follow the following patterns: + +- The `general pattern`, which is universally applicable and SHOULD be used to specify the path relative to the ARC root. + +- The `folder specific pattern`, which MAY be used only in specific metadata contexts: + - Data nodes in `isa.assay.xlsx` files: The path MAY be specified relative to the `dataset` sub-folder of the assay + - Data nodes in `isa.study.xlsx` files: The path MAY be specified relative to the `resources` sub-folder of the study + +### Examples + +#### General Pattern + +In this example, there are two `assays`, with `Assay1` containing a measurement of a `Source` material, producing an output `Data`. `Assay2` references this `Data` for producing a new `Data`. + +Use of `general pattern` relative paths from the arc root folder: + +`assays/Assay1/isa.assay.xlsx`: + +| Input [Source Name] | Component [Instrument model] | Output [Data] | +|-------------|---------------------------------|----------------------------------| +| input | Bruker 500 Avance | assays/Assay1/dataset/measurement.txt | + +`assays/Assay2/isa.assay.xlsx`: + +| Input [Data] | Component [script file] | Output [Data] | +|----------------------------------|---------------------------------|----------------------------------| +| assays/Assay1/dataset/measurement.txt | assays/Assay2/dataset/script.sh | assays/Assay2/dataset/result.txt | + +#### Folder Specific Pattern + +In this example, there are two `assays`, with `Assay1` containing a measurement of a `Source` material, producing an output `Data`. `Assay2` references this `Data` for producing a new `Data`. + +Use of `folder specific pattern` relative paths from `Assay1` and `Assay2` `Dataset` folders, respectively: + +`assays/Assay1/isa.assay.xlsx`: + +| Input [Source Name] | Component [Instrument model] | Output [Data] | +|-------------|---------------------------------|----------------------------------| +| input | Bruker 500 Avance | measurement.txt | + +`assays/Assay2/isa.assay.xlsx`: + +| Input [Data] | Component [script file] | Output [Data] | +|----------------------------------|---------------------------------|----------------------------------| +| assays/Assay1/dataset/measurement.txt | script.sh | result.txt | + +Note, that to reference `Data` which is part of `Assays1` in `Assay2`, the `general pattern` is necessary either way. Therefore it is considered the more broadly applicable and recommended pattern. + +# Shareable and Publishable ARCs ARCs can be shared in any state. They are considered *publishable* (e.g. for the purpose of minting a DOI) when fulfilling the following conditions: @@ -260,21 +328,308 @@ Notes: - Minimal administrative metadata ensure compliance with DataCite for DOI creation -### Reproducible ARCs +# Reproducible ARCs Reproducibility of ARCs refers mainly to its *runs*. Within an ARC, it MUST be possible to reproduce the *run* data. Therefore, necessary software MUST be available in *workflows*. In the case of non-deterministic software the run results should represent typical examples. -## Mechanism for Quality Control of ARCs +# Mechanisms for ARC Quality Control + +ARCs are supposed to be living research objects and are as such never complete. +Nevertheless, a mechanism to continuously report the current state and quality of an ARC is indispensable. + +## Validation + +The process of assessing quality parameters of an ARC is further referred to as _validation_ of the ARC against a [_validation package_](#validation-packages), where the _validation package_ is an arbitrary set of [validation cases](#validation-cases) that the ARC MUST pass to qualify as _valid_ in regard to the _validation package_. + +### Validation cases + +A **validation case** is the atomic unit of a [validation package](#validation-packages) describing a single, deterministic and reproducible requirement that the ARC MUST satisfy in order to qualify as _valid_ in regard to it. + +Format and scope of these cases naturally vary depending on the type of ARC, aim of the containing validation package and tools used for creating and performing the validation. +Therefore, no further requirements are made on the format of validation cases. + + example: + + > The following example shows a validation case simply defined using natural language. + + ``` + All Sample names in this ARC must be prefixed with the string "Sample_" + ``` + + Any ARC where all sample names are prefixed with the string "Sample_" would be considered valid in regard to this validation case. + +### Validation packages + +A **validation package** bundles a collection of [validation cases](#validation-cases) that the ARC MUST pass to qualify as _valid_ in regard to the _validation package_ with instructions on how to perform the validation and summarize the results. + +Validation packages + +- MUST be executable. + This can for example be achieved by implementing them in a programming language, a shell script, or a workflow language. + +- MUST validate an ARC against all contained validation cases upon execution. + +- MUST have a globally unique name. + This will eventually be enforced by a central validation package registry + +- SHOULD be versioned using [semantic versioning](https://semver.org/) + +- MUST be enriched with the following mandatory metadata in an appropriate way (e.g. via yaml frontmatter, tables in a database, etc.): + | Field | Type | Description | + | --- | --- | --- | + | Name | string | the name of the package | + | Version | string | the version of the package | + | Summary | string | a single sentence description (<=50 words) of the package | + | Description | string | an unconstrained free text description of the package | + +- MAY be enriched with the following optional metadata in an appropriate way (e.g. via yaml frontmatter, tables in a database, etc.): + | Field | Type | Description | + | --- | --- | --- | + | HookEndpoint | string | An URL to trigger subsequent events based on the result of executing the validation package in a CQC context, see [Continuous quality control](#continuous-quality-control) and [ARC Apps](#arc-apps) | + +- MAY be enriched with any additional metadata in an appropriate way (e.g. via yaml frontmatter, tables in a database, etc.). + +- MUST create a `validation_report.*` file upon execution that summarizes the results of validating the ARC against the cases defined in the validation package. + The format of this file SHOULD be of an established test result format such as [JUnit XML](https://github.com/windyroad/JUnit-Schema) or [TAP](https://testanything.org/). + +- MUST create a `badge.svg` file upon execution that visually summarizes the results of validating the ARC against the validation cases defined in the validation package. + The information displayed SHOULD be derivable from the `validation_report.*` file and MUST include the _Name_ of the validation package. + +- MUST create a `validation_summary.json` file upon execution, which contains the mandatory and optional metadata specified above, and a high-level summary of the execution of the validation package following this schema: +
+ validation_summary.json schema + + ```json + { + "$schema": "http://json-schema.org/draft-04/schema#", + "type": "object", + "properties": { + "Critical": { + "type": "object", + "properties": { + "HasFailures": { + "type": "boolean" + }, + "Total": { + "type": "integer" + }, + "Passed": { + "type": "integer" + }, + "Failed": { + "type": "integer" + }, + "Errored": { + "type": "integer" + } + }, + "required": [ + "HasFailures", + "Total", + "Passed", + "Failed", + "Errored" + ] + }, + "NonCritical": { + "type": "object", + "properties": { + "HasFailures": { + "type": "boolean" + }, + "Total": { + "type": "integer" + }, + "Passed": { + "type": "integer" + }, + "Failed": { + "type": "integer" + }, + "Errored": { + "type": "integer" + } + }, + "required": [ + "HasFailures", + "Total", + "Passed", + "Failed", + "Errored" + ] + }, + "ValidationPackage": { + "type": "object", + "properties": { + "Name": { + "type": "string" + }, + "Version": { + "type": "string" + }, + "Summary": { + "type": "string" + }, + "Description": { + "type": "string" + }, + "HookEndpoint": { + "type": "string" + } + }, + "required": [ + "Name", + "Version", + "Summary", + "Description" + ] + } + }, + "required": [ + "Critical", + "NonCritical", + "ValidationPackage" + ] + } + ``` + +
+ +- SHOULD aggregate the result files in an appropriately named subdirectory. + +### Reference implementation + +A reference implementation for creating validation cases, validation packages, and validating ARCs against them is provided in the [arc-validate software suite](https://github.com/nfdi4plants/arc-validate) + +## Continuous quality control + +In addition to manually validate ARCs against validation packages, ARCs MAY be continuously validated against validation packages using a continuous integration (CI) system. +This process is further referred to as _Continuous Quality Control_ (CQC) of the ARC. CQC can be triggered by any event that is supported by the CI system, e.g. a push to a branch of the ARC repository or a pull request. + +### The cqc branch + +To make sure that validation results are bundled with ARCs but do not pollute their commit history, validation results MUST be stored in a separate branch of the ARC repository. +This branch: + +- MUST be named `cqc` +- MUST be an [orphan branch](https://git-scm.com/docs/git-checkout#Documentation/git-checkout.txt---orphanltnew-branchgt) +- MUST NOT be merged into any other branch. +- MUST contain the following folder structure: + + `{$branch}/{$package}`: + + ``` + cqc branch root + └── {$branch} + └── {$package} + ``` + + where: + - `{$branch}` is the name of the branch the validation was run on + - `{$package}` is the name of the validation package the validation was run against. + this folder then MUST contain the files `validation_report.*` and `badge.svg` as described in the [validation package specification](#validation-packages). + This folder MAY also be suffixed by the version of the validation package via a `@` character followed by the version number of the validation package: `{$package}@{$version}`, e.g. `package1@1.0.0`. + + example: + + > This example shows the validation results of the `main` and `branch-1` branches of the ARC repository against the `package1` and `package2` validation packages. for `package2`, an optional version hint of the package is included in the folder name: + + ``` + cqc-branch-root + ├── branch-1 + │ ├── package1 + │ │ ├── badge.svg + │ │ └── validation_report.xml + │ └── package2@2.0.0 + │ ├── badge.svg + │ └── validation_report.xml + └── main + ├── package1 + │ ├── badge.svg + │ └── validation_report.xml + └── package2@2.0.0 + ├── badge.svg + └── validation_report.xml + ``` + +Commits to the `cqc` branch MUST contain the commit hash of the commit that was validated in the commit message. + +### The validation_packages.yml file + +The `validation_packages.yml` specifies the validation packages that the branch containing the file will be validated against. +Each branch of an ARC MAY contain 0 or 1 `validation_packages.yml` files. +If the file is present, it: + +- MAY contain a `specification` key which, when present, MUST contain the version of the ARC specification that the ARC should be validated against. Schema specification should be tied to specification releases, and be directly integrated into tools that can perform validation against validation packages. +- MUST be located in the `.arc` folder in the root of the ARC +- MUST contain the `validation_packages` key which is a list of validation packages that the current branch will be validated against. + + values of the `validation_packages` list are objects with the following fields: + + - `name`: the name of the validation package. This field is mandatory and MUST be included for each validation package object. This name MUST be unique across all validation packages object, which means that only one version of a package can be contained in the file. + - `version`: the version of the validation package. This field is optional and MAY be included for each validation package object. If included, it MUST be a valid [semantic version](https://semver.org/), restricted to MAJOR.MINOR.PATCH format. If not included, this indicates that the latest available version of the validation package will be used. + +example: + +> This example shows a `validation_packages.yml` file that specifies that the current branch will be validated against: version `2.0.0-draft` of the ARC specification, version `1.0.0` of `package1`, version `2.0.0` of `package2`, and the latest available version of `package3`. + +```yaml +arc_specification: 2.0.0-draft +validation_packages: + - name: package1 + version: 1.0.0 + - name: package2 + version: 2.0.0 + - name: package3 +``` + +### ARC Apps + +Continuous Quality Control enables to check at any time in the ARC life cycle whether it passes certain criteria or not. + +However, **if** an ARC is valid for a given _target_ is only half of the equation - the other being taking some kind of action based on this information. One large field of actions here is the publication of the ARC or (some) of it's contents to an **endpoint repository (ER)** (e.g. [PRIDE](https://www.ebi.ac.uk/pride/), [ENA](https://www.ebi.ac.uk/ena/browser/home)). -ARCs are supposed to be living research objects and are as such never complete. Nevertheless, a mechanism to report the current state and quality of an ARC is indispensable. Therefore, ARCs will be scored according to the amount of metadata information available (specifically with established minimal metadata standards such as MinSeqE, MIAPPE, etc.), the quality of data and metadata (this metric will be established in the next version), manual curation and review, and the reuse of ARCs by other researchers measured by physical includes of the data and referencing. +In this example, a validation package SHOULD only determine if the content _COULD_ be published to the ER, and a subsequent service SHOULD then take the respective action based on the reported result of that package (e.g. fixing errors based on the report, or publish the content to the ER). -To foster FAIRification, badges will be earned by reaching certain scores for a transparent review process. +**ARC apps** are services that provide URLs called _(CQC) Hook Endpoints_ that be triggered manually or by the result of a validation package. They are intended to automate the process of taking action based on the result of a validation package. + +### Reference implementation + +PLANTDataHUB performs Continuous Quality Control of ARCs using the [arc-validate software suite](https://github.com/nfdi4plants/arc-validate) as described in our 2023 paper [PLANTdataHUB: a collaborative platform for continuous FAIR data sharing in plant research](https://doi.org/10.1111/tpj.16474). + +The following sequence diagram shows the conceptual implementation of CQC pipelines in conjunction with ARC Apps connected via CQC Hooks on the reference DataHUB instance with the following participants: + +- **User**: The user who works on an ARC published on the DataHUB +- **ARC**: The ARC repository on the DataHUB +- **DataHUB**: The DataHUB instance +- **ARC App**: A service that provides a CQC Hook Endpoint to perform actions based on validation results and/or user input + +```mermaid +sequenceDiagram + + participant User + participant ARC + participant DataHUB + participant ARC App + + Note over User, DataHUB: Validation (CQC pipeline) + User ->> ARC : commit + DataHUB ->> DataHUB : trigger validation for commit + DataHUB ->> ARC : commit validation results
to cqc branch + DataHUB ->> ARC : create badge + Note over User, ARC App: CQC Hooks + User ->> ARC App : click on badge link + DataHUB ->> ARC App : trigger some action based on validation results + ARC App ->> DataHUB : Request relevant information + DataHUB ->> ARC App : send relevant information (when granted access) + ARC App ->> ARC App : Perform action with retrieved data +``` -## Best Practices +# Best Practices In the next section we provide you with Best Practices to make the use of an ARC even more efficient and valuable for open science. -### Community Specific Data Formats +## Community Specific Data Formats It is recommend to use community specific data formats covering most common measurement techniques. Using the following recommended formats will ensure improved accessibility and findability: @@ -290,31 +645,32 @@ Notes: - In case of storing vendor-specific data within an ARC, it is strongly encouraged to accompany them by the corresponding open formats or provide a workflow for conversion or processing where this is possible and considered standard. -### Compression and Encryption +## Compression and Encryption Compression is preferable to save disk space and speed up data transfers but not required. Without compression workflows are simpler as often no transparent compression and decompression is available. Uncompressed files are usually easier to index and better searchable. Encryption is not advised (but could be an option to share sensitive data in an otherwise open ARC). -### Directory and File Naming Conventions +## Directory and File Naming Conventions Required files defined in the ARC structure need to be named accordingly. Files and folders specified < > can be named freely. As the ARC might be used by different persons and in different workflow contexts, we recommend concise filenames without blanks and special characters. Therefore, filenames SHOULD stick to small and capital letters without umlauts, accented and similar special characters. Numbers, hyphens, and underscores are suitable as well. Modern working environments can handle blanks in filenames but might confuse automatically run scripts and thus SHOULD be avoided. Depending on the intended amount of people the ARC is shared with, certain information might prove useful to provide a fast overview in human readable form in the filename, e.g. by providing abbreviations of the project, sub project, person creating or working on a particular dataset. Date and time information might be encoded as well if it provides a better ordering or information for the particular purpose. -## Appendix: Conversion of ARCs to RO Crates +# Appendix: Conversion of ARCs to RO Crates [Research Object (RO) Crate](https://www.researchobject.org/ro-crate/) is a lightweight approach, based on [schema.org](https://schema.org), to package research data together with their metadata. An ARC can be augmented into an RO Crate by placing a metadata file `ro-crate-metadata.json` into the top-level ARC folder, which must conform to the [RO Crate specification](https://www.researchobject.org/ro-crate/1.1/). The ARC root folder is then simultaneously the RO Crate Root and represents an ISA investigation. The studies, assays and workflows are part of the investigation and linked to it using the typical RO-Crate methodology, e.g. the `hasPart` property of `http://schema.org/Dataset`. -All four object types follow their corresponding profiles (WIP for studies, assays and workflows). +All four object types follow their [corresponding profiles](https://github.com/nfdi4plants/isa-ro-crate-profile/blob/main/profile/isa_ro_crate.md). It is RECOMMENDED to adhere to the following conventions when creating this file: -- The root data entity follows the [ISA Investigation profile](https://github.com/nfdi4plants/arc-to-rocrate/blob/main/profiles/investigation.md). +- The root data entity follows the [ISA Investigation profile](https://github.com/nfdi4plants/isa-ro-crate-profile/blob/main/profile/isa_ro_crate.md). - The root data entity description are taken from the "Investigation Description" term in `isa.investigation.xlsx`. - - The root data entity authors are taken from the "Investigation Contacts" in `isa.investigation.xlsx`: + - The root data entity authors are taken from the "Investigation Contacts" in `isa.investigation.xlsx`. - The root data entity citations are taken from the "Investigation Publications" section in `isa.investigation.xlsx`. - For each assay and study linked from `isa.investigation.xlsx`, one dataset entity is provided in `ro-crate-metadata.json`. The Dataset id corresponds to the relative path of the assay ISA file under `assays/`, e.g. "sample-data/isa.assay.xlsx". Other metadata is taken from the corresponding terms in the corresponding `isa.assay.xlsx` or `isa.study.xlsx`. - The root data entity is connected to each assay and study through the `hasPart` Property. +- The assay and study entities follow the [ISA Assay Profile](https://github.com/nfdi4plants/isa-ro-crate-profile/blob/main/profile/isa_ro_crate.md) or the [ISA Study Profile](https://github.com/nfdi4plants/isa-ro-crate-profile/blob/main/profile/isa_ro_crate.md), respectively. It is expected that future versions of this specification will provide additional guidance on a comprehensive conversion of ARC metadata into RO-Crate metadata. diff --git a/ISA-XLSX.md b/ISA-XLSX.md index 0f3afde..69de5d7 100644 --- a/ISA-XLSX.md +++ b/ISA-XLSX.md @@ -9,6 +9,7 @@ This document describes the ISA Abstract Model reference implementation specifie - [Investigation File](#investigation-file) - [Study File](#study-file) - [Assay File](#assay-file) +- [Datamap File](#datamap-file) - [Top-level metadata sheets](#top-level-metadata-sheets) - [Ontology Source Reference section](#ontology-source-reference-section) - [INVESTIGATION section](#investigation-section) @@ -23,7 +24,18 @@ This document describes the ISA Abstract Model reference implementation specifie - [Factors](#factors) - [Components](#components) - [Parameters](#parameters) - - [Examples](#examples) + - [Comments](#comments) + - [Examples](#examples-1) +- [Datamap Table sheets](#datamap-table-sheets) + - [Data](#data-column) + - [Explication](#explication-column) + - [Unit](#unit-column) + - [Object Type](#object-type-column) + - [Description](#description-column) + - [Generated By](#generated-by-column) + - [Comment](#comments-1) + - [Examples](#examples-2) + Below we provide the schemas and the content rules for valid ISA-XLSX documents. @@ -72,6 +84,9 @@ The `Investigation File` MUST contain one [`Top-Level Metadata sheet`](#top-leve - [`INVESTIGATION`](#investigation) - [`INVESTIGATION PUBLICATIONS`](#investigation-publications) - [`INVESTIGATION CONTACTS`](#investigation-contacts) + +Additionally, it MAY contain the following sections: + - [`STUDY`](#study-section) - [`STUDY DESIGN DESCRIPTORS`](#study-design-descriptors) - [`STUDY PUBLICATIONS`](#study-publications) @@ -91,10 +106,13 @@ The `Study File` MUST contain one [`Top-Level Metadata sheet`](#top-level-metada - [`STUDY`](#study-section) - [`STUDY DESIGN DESCRIPTORS`](#study-design-descriptors) - [`STUDY PUBLICATIONS`](#study-publications) +- [`STUDY CONTACTS`](#study-contacts) + +Additionally, it MAY contain the following sections: + - [`STUDY FACTORS`](#study-factors) - [`STUDY ASSAYS`](#study-assays) - [`STUDY PROTOCOLS`](#study-protocols) -- [`STUDY CONTACTS`](#study-contacts) Additionally, the `Study File` SHOULD contain one or more [`Annotation Table sheet(s)`](#annotation-table-sheets), which MAY record provenance of biological samples, from source material through a collection process to sample material. @@ -117,6 +135,16 @@ Therefore, the main entities of the `Assay File` should be `Samples` and `Data`. The `Assay File` implements the [`Assay`](https://isa-specs.readthedocs.io/en/latest/isamodel.html#assay) graph from the ISA Abstract Model. +# Datamap File + +The `Datamap` represents a set of explanations about the `data` entities defined in `assays` and `studies`. + +The `Datamap File` MUST contain one [`Datamap table sheet`](#datamap-table-sheets). This sheet MUST be named `isa_datamap`. + +Therefore, the main entities of the `Datamap File` should be `Data`. + +The `Datamap File` acts as an extension of the `data` nodes defined in the [`Study and Assay graphs section`](https://isa-specs.readthedocs.io/en/latest/isamodel.html#study-and-assay-graphs) from the ISA Abstract Model. + # Top-level metadata sheets The purpose of top-level metadata sheets is aggregating and listing top-level metadata. Each sheet consists of sections consisting of a section header and key-value fields. Section headers MUST be completely written in upper case (e.g. STUDY), field headers MUST have the first letter of each word in upper case (e.g. Study Identifier); with the exception of the referencing label (REF). @@ -477,22 +505,20 @@ For example, the `STUDY PROTOCOLS` section of an ISA-XLSX `isa.investigation.xls | Study Protocol Components Type Term Accession Number | http://purl.obolibrary.org/obo/NCIT_C68796 | | ;;http://purl.obolibrary.org/obo/MS_1002732 | Study Protocol Components Type Term Source REF | NCIT | | ;;MS - ### STUDY CONTACTS This section MUST contain zero or more values. This section MUST contain the following labels, with the specified datatypes for values supported: -| Label | Datatype | Description | +| Label | Datatype | Description | |------------------------------------------|---------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Study Person Last Name | String | The last name of a person associated with the study. | | Study Person First Name | String | Study Person Name | -| Study Person Mid Initials | String | The middle initials of a person associated with the study. -| +| Study Person Mid Initials | String | The middle initials of a person associated with the study. | | Study Person Email | String formatted as email | The email address of a person associated with the study. | | Study Person Phone | String | The telephone number of a person associated with the study. | -| IStudy Person Fax | String | The fax number of a person associated with the study. | +| Study Person Fax | String | The fax number of a person associated with the study. | | Study Person Address | String | The address of a person associated with the study. | | Study Person Affiliation | String | The organization affiliation for a person associated with the study. | | Study Person Roles | String or Ontology Annotation if accompanied by Term Accession Numbers and Term Source REFs | Term to classify the role(s) performed by this person in the context of the study, which means that the roles reported here need not correspond to roles held withing their affiliated organization. Multiple annotations or values attached to one person can be provided by using a semicolon (“;”) Unicode (U0003+B) as a separator (e.g.: submitter;funder;sponsor) .The term can be free text or from, for example, a controlled vocabulary or an ontology. If the latter source is used the Term Accession Number and Term Source REF fields below are required. | @@ -536,7 +562,7 @@ This section MUST contain the following labels, with the specified datatypes for | Label | Datatype | Description | |----------------------------------------------------|------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Assay Measurement Type | String | A term to qualify the endpoint, or what is being measured (e.g. gene expression profiling or protein identification). The term can be free text or from, for example, a controlled vocabulary or an ontology. If the latter source is used the Term Accession Number and Term Source REF fields below are required. | -| Study Assay Measurement Type Term Accession Number | String | The accession number from the Term Source associated with the selected term. | +| Assay Measurement Type Term Accession Number | String | The accession number from the Term Source associated with the selected term. | | Assay Measurement Type Term Source REF | String | The Source REF has to match one of the Term Source Name declared in the Ontology Source Reference section. | | Assay Technology Type | String | Term to identify the technology used to perform the measurement, e.g. DNA microarray, mass spectrometry. The term can be free text or from, for example, a controlled vocabulary or an ontology. If the latter source is used the Term Accession Number and Term Source REF fields below are required. | | Assay Technology Type Term Accession Number | String | The accession number from the Term Source associated with the selected term. | @@ -603,6 +629,8 @@ For example, the `ASSAY PERFORMERS` section of an ISA-XLSX `isa.assay.xlsx` file # Annotation Table sheets +`Annotation Table sheets` are used to describe the experimental flow in detailed, machine readable way. In each sheet, there is a mapping from input entities to output entities, placed in the `Input` and `Output` columns, accordingly. The other columns then are used to either describe those entities or the processes that led to this mapping. + In the `Annotation Table sheets`, column headers MUST have the first letter of each word in upper case, with the exception of the referencing label (REF). The content of the annotation table MUST be placed in an `xlsx table` whose name starts with `annotationTable`. Each sheet MUST contain at most one such annotation table. Only cells inside this table are considered as part of the formatted metadata. @@ -611,7 +639,7 @@ The content of the annotation table MUST be placed in an `xlsx table` whose name ## Inputs and Outputs -Each annotation table sheet MUST contain an `Input` and an `Output` column, which denote the Input and Output node of the `Process` node respectively. They MUST be formatted in the pattern `Input []` and `Output []`. +Each annotation table sheet MUST contain at most one `Input` and at most one `Output` column, which denote the Input and Output node of the `Process` node respectively. They MUST be formatted in the pattern `Input []` and `Output []`. `NodeTypes` MUST be one of the following: @@ -619,19 +647,31 @@ Each annotation table sheet MUST contain an `Input` and an `Output` column, whic - A `Sample` MUST be indicated with the node type `Sample Name`. -- An `Extract Material` MUST be indicated with the node type `Extract Name`. +- An `Extract Material` MUST be indicated with the node type `Material Name`. -- A `Labeled Extract Material` MUST be indicated with the node type `Labeled Extract Name`. +- A `Data` object MUST be indicated with the node type `Data`. -- An `Image File` MUST be indicated with the node type `Image File`. +`Source Names`, `Sample Names`, `Material Names` MUST be unique across an ARC. If two of these entities with the same name exist in the same ARC, they are considered the same entity. -- A `Raw Data File` MUST be indicated with the node type `Raw Data File`. +The `Data` node type MUST correspond to a relevant data resource location, following the [Data Path Annotation](/ARC%20specification.md#data-path-annotation) patterns. If the annotation of the `Data` node refers not to the complete resource, but a part of it, a `Selector` MAY be added. This Selector MUST be separated from the resource location using a `#`— with no whitespace between: `location#selector`. If appropriate, the Selector SHOULD be formatted according to IRI fragment selectors specified by [W3](https://www.w3.org/TR/annotation-model/#fragment-selector). -- A `Derived Data File` MUST be indicated with the node type `Derived Data File`. +The format of the data resource MAY be further qualified using a `Data Format` column. The `Data Format` SHOULD be expressed using a [MIME format](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types), most commonly consisting of two parts: a type and a subtype, separated by a slash (/) — with no whitespace between: `type/subtype`. If appropriate, a format from the list composed by [IANA](https://www.iana.org/assignments/media-types/media-types.xhtml) +SHOULD be picked. Unregistered or niche encoding and file formats MAY be indicated instead via the most appropriate URL. -`Source Names`, `Sample Names`, `Extract Names` and `Labeled Extract Names` MUST be unique across an ARC. If two of these entities with the same name exist in the same ARC, they are considered the same entity. +The format and usage info about the Selector MAY be further qualified using a `Data Selector Format` column. The `Data Selector Format` SHOULD point to a web resource containing instructions about how the Selector is formatted and how it should be interpreted. -`Image File`, `Raw Data File` or `Derived Data File` node types MUST correspond to a relevant file location. + +## Examples + +### Data Location and Selector + +In this example, there is a measurement of two `Samples`, namely `input1` and `input2`. The values measured are both written into the same data resource in the location `result.csv`, whichs formatting is tabular, according to the `Data Format` being `text/csv`. To distinguish between the measurement values stemming from the different inputs, selectors were added to the resource location (seperated by a `#`), namely `col=1` and `col=2`. The specification about the formatting of these selectors can be found in the provided link, namely `https://datatracker.ietf.org/`. + + +| Input [Sample Name] | Output [Data] | Data Format | Data Selector Format | +|-------------|---------------------------------|----------------------------------|--| +| input1 | result.csv#col=1 | text/csv | https://datatracker.ietf.org/doc/html/rfc7111 | +| input2 | result.csv#col=2 | text/csv | https://datatracker.ietf.org/doc/html/rfc7111 | ## Protocol Columns @@ -651,9 +691,9 @@ Each annotation table sheet MUST contain an `Input` and an `Output` column, whic Where a value is an `Ontology Annotation` in a table file, `Term Accession Number` and `Term Source REF` fields MUST follow the column cell in which the value is entered. These two columns SHOULD contain further ontological information about the header. In this case, following the static header string, separated by a single space, there MUST be a short ontology term identifier formatted as CURIEs (prefixed identifiers) of the form `:` (specified [here](http://obofoundry.org/id-policy)) inside `()` brackets. For example, a characteristic type `organism` with a value of `Homo sapiens` can be qualified with an `Ontology Annotation` of a term from NCBI Taxonomy as follows: -| Characteristics [organism] | Term Source REF (OBI:0100026) | Term Accession Number (OBI:0100026) | +| Characteristic [organism] | Term Source REF (OBI:0100026) | Term Accession Number (OBI:0100026) | |-----------------------------|-------------------|------------------------------------------------------| -| Homo sapiens | NCBITaxon | [http://…/NCBITAXON/9606](http://.../NCBITAXON/9606) | +| Homo sapiens | NCBITaxon | [http://…/NCBITAXON_9606](http://.../NCBITAXON_9606) | An `Ontology Annotation` MAY be applied to any appropriate `Characteristic`, `Parameter`, `Factor`, `Component` or `Protocol Type`. @@ -709,6 +749,14 @@ A `Parameter` can be used to specify any additional information about the experi |--------------------------------|--------|-------------------|------------------------------------------------------| | 300 | Kelvin | UO | [http://…/obo/UO_0000032](http://purl.obolibrary.org/obo/UO_0000032) | +## Comments + +A `Comment` can be used to provide some additional information. Columns headed with `Comment[]` MAY appear anywhere in the Annotation Table. The comment always refers to the Annotation Table. The value MUST be free text. + +| Comment [Answer to everything] | +|--------------------------------| +| forty-two | + ## Others Columns whose headers do not follow any of the formats described above are considered additional payload and are out of the scope of this specification. @@ -735,4 +783,85 @@ If we pool two sources into a single sample, we might represent this as: | Input [Source Name] | Protocol REF | Output [Sample Name] | |---------------|-------------------|---------------| | source1 | sample collection | sample1 | -| source2 | sample collection | sample1 | \ No newline at end of file +| source2 | sample collection | sample1 | + +# Datamap Table sheets + +`Datamap Table sheets` are used to describe the contents of data files. + +In the `Datamap Table sheets`, column headers MUST have the first letter of each word in upper case, with the exception of the referencing label (REF). + +The content of the datamap table MUST be placed in an `xlsx table` whose name equals `datamapTable`. Each sheet MUST contain at most one such datamap table. Only cells inside this table are considered as part of the formatted metadata. + +`Datamap Table sheets` are structured with fields organized on a per-row basis. The first row MUST be used for column headers. Each body row is an implementation of a `data` node. + +## Data column + +Every `Datamap Table sheet` MUST contain a `Data` column. Every object in this column MUST correspond to a relevant data resource location, following the [Data Path Annotation](/ARC%20specification.md#data-path-annotation) patterns. If the annotation of the `Data` node refers not to the complete resource, but a part of it, a `Selector` MAY be added. This Selector MUST be separated from the resource location using a `#`— with no whitespace between: `location#selector`. If appropriate, the Selector SHOULD be formatted according to IRI fragment selectors specified by [W3](https://www.w3.org/TR/annotation-model/#fragment-selector). + +The format of the data resource MAY be further qualified using a `Data Format` column. The `Data Format` SHOULD be expressed using a [MIME format](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types), most commonly consisting of two parts: a type and a subtype, separated by a slash (/) — with no whitespace between: `type/subtype`. If appropriate, a format from the list composed by [IANA](https://www.iana.org/assignments/media-types/media-types.xhtml) +SHOULD be picked. Unregistered or niche encoding and file formats MAY be indicated instead via the most appropriate URL. + +The format and usage info about the Selector MAY be further qualified using a `Data Selector Format` column. The `Data Selector Format` SHOULD point to a web resource containing instructions about how the Selector is formatted and how it should be interpreted. + +## Explication column + +Every `Datamap Table sheet` SHOULD contain an `Explication` column. The `Explication` adds explicit meaning to the data node. The value MUST be free text, or an [`Ontology Annotation`](#ontology-annotations). + +| Explication | Term Source REF | Term Accession Number | +|------------------------|-------------------|-------------------------| +| average value | OBI | [http://…/obo/OBI_0000679](http://purl.obolibrary.org/obo/OBI_0000679) | + +## Unit column + +Every `Datamap Table sheet` SHOULD contain an `Unit` column. The `Unit` adds a unit of measurement to the data node. The value MUST be free text, or an [`Ontology Annotation`](#ontology-annotations). + +| Unit | Term Source REF | Term Accession Number | +|------------------------|-------------------|-------------------------| +| milligram per milliliter | UO | [http://…/obo/UO_0000176](http://purl.obolibrary.org/obo/UO_0000176) | + +## Object Type column + +Every `Datamap Table sheet` SHOULD contain an `Object Type` column. The `Object Type` defines the shape or format in which the data node is represented. The value MUST be free text, or an [`Ontology Annotation`](#ontology-annotations). + +| Object Type | Term Source REF | Term Accession Number | +|------------------------|-------------------|-------------------------| +| Float | NCIT | [http://…/obo/NCIT_C48150](http://purl.obolibrary.org/obo/NCIT_C48150) | + +## Description column + +Every `Datamap Table sheet` SHOULD contain a `Description` column. The `Description` gives additional, humand readable context about the data node. The value MUST be free text. + +| Description | +|------------------------| +| The average protein concentration for the given gene | + +## Generated By column + +Every `Datamap Table sheet` SHOULD contain a `Generated By` column. The `Generated By` names the tool which led to the creation of the data node. The value MUST be free text. + +If possible, the value in this column MUST correspond to a relevant data resource location, following the [Data Path Annotation](/ARC%20specification.md#data-path-annotation) patterns. + +| Generated By | +|------------------------| +| GeneStatisticsTool.exe | + +## Comments + +A `Comment` can be used to provide some additional information. Columns headed with `Comment[]` MAY appear anywhere in the Annotation Table. The comment always refers to the Annotation Table. The value MUST be free text. + +| Comment [Answer to everything] | +|--------------------------------| +| forty-two | + +## Examples + +For example, a simple `datamap` table representing a tabular datafile might look as follows: + +| Data | Explication | Term Source REF | Term Accession Number | Unit | Term Source REF | Term Accession Number | Object Type | Term Source REF | Term Accession Number | Description |GeneratedBy | +|---------------|---------------|-------------------|---------------|---------------|-------------------|---------------|---------------|-------------------|---------------|---------------|---------------| +| MyData.csv#col=1 | Gene Identifier | NCIT | [http://…/obo/NCIT_C48664](http://purl.obolibrary.org/obo/NCIT_C48664) | | | | String | NCIT | [http://…/obo/NCIT_C45253](http://purl.obolibrary.org/obo/NCIT_C45253) | Short hand identifier of the gene coding for the protein. | GeneStatisticsTool.exe | +| MyData.csv#col=2 | average value | OBI | [http://…/obo/OBI_0000679](http://purl.obolibrary.org/obo/OBI_0000679) | milligram per milliliter | UO | [http://…/obo/UO_0000176](http://purl.obolibrary.org/obo/UO_0000176) | Float | NCIT | [http://…/obo/NCIT_C48150](http://purl.obolibrary.org/obo/NCIT_C48150) | The average protein concentration for the given gene |GeneStatisticsTool.exe | +| MyData.csv#col=3 | p-value | OBI | [http://…/obo/OBI_0000175](http://purl.obolibrary.org/obo/OBI_0000175) | | | | Float | NCIT | [http://…/obo/NCIT_C48150](http://purl.obolibrary.org/obo/NCIT_C48150) | p-value of t-test against control. | GeneStatisticsTool.exe | + +In this example, the `datamap` table describes a single data file named `MyData.csv`. This file contains three columns. The first column contains gene identifiers, the other two results of a statistical analysis performed by the tool GeneStatisticsTool.exe. \ No newline at end of file