bigbio · ypriverol · Sep 14, 2024 · Sep 12, 2024 · Sep 12, 2024 · Sep 12, 2024
diff --git a/docs/README.adoc b/docs/README.adoc
@@ -103,7 +103,7 @@ For example:
 [[identification-scores]]
 === Identification Scores
 
-Every workflow within quantms uses different identification scores to determinate the quality of the identification. `IdScores` in quantms try to capture multiple scores from different workflows such as the `Comet:xcorr` or `DIA-NN:Q.Value`. The identification scores are stored as a key/value pair where the key is the name of the score (is RECOMMENDED to use HUPO-PSI MS ontology) and the value is the score value. This concept is used in the following outputs:
+Every workflow within quantms uses different identification scores to determinate the quality of the identification. `additional_scores` in quantms try to capture multiple scores from different workflows such as the `Comet:xcorr` or `DIA-NN:Q.Value`. The identification scores are stored as a key/value pair where the key is the name of the score (is RECOMMENDED to use HUPO-PSI MS ontology) and the value is the score value. This concept is used in the following outputs:
 
 - `["'Comet:xcorr':67.8", "'DIA-NN:Q.Value':0.01"]`
 
@@ -113,6 +113,15 @@ This concept is used in the following outputs:
 - <<feature>>
 - <<peptide>>
 
+[[cv-terms]]
+=== Controlled Vocabulary Terms
+
+The following views <<psm>> <<feature>> use controlled vocabularies to describe the data. The controlled vocabulary terms are used to standardize the data and make it easier to integrate with other datasets. The controlled vocabulary terms are stored as a key/value pair where the key is the name of the controlled vocabulary term and the value is the term value. This concept is used in the following outputs:
+
+- `["ms level": "2", "deconvoluted data": null]`
+
+The name/key of the controlled vocabulary MUST be provided; the value is optional.
+
 [[serialization]]
 == Serialization formats
 
@@ -164,7 +173,18 @@ The structure of the version is as follows `{major release}.{minor update}`: The
 - All views and serialization formats will have a version number in the way: `quantmsio_version: {}`. This will help to identify the version of the specification used to generate the file.
 - Major release changes will be backward incompatible, while minor updates will be backward compatible.
 
-[[psm]]
+[[software]]
+== Software provider
+
+The data within quantms.io is mainly generated from https://github.com/bigbio/quantms[quantms workflow]. However, the format is open and can be used by any software provider that wants to generate the data in this format. The software provider and the version of the software used to generate the data will be stored in the project view <<project>> as:
+
+[source,json]
+----
+"software_provider": {
+    "name": "quantms",
+    "version": "1.3.0"
+  }
+----
 
 [[project]]
 == Project quantms.io
@@ -190,8 +210,8 @@ The project view is the file that stores the metadata of the entire `quantms.io`
 | `acquisition_properties`  | Properties of the data acquisition methods  | list[key/value]
 | `quantms_files`           | Files related to quantMS analysis           | list[key/value]
 | `quantmsio_version`       | Version of the `quantms.io`                 | String
-| `quantmsversion`         | Version of the quantms workflow             | String
-| `comments`               | Additional comments or notes                | List of Strings
+| `software_provider`       | The software used to generate the data <<software>> | Key/Value
+| `comments`                | Additional comments or notes                | List of Strings
 |===
 
 Key/Value pair object: The key/value pairs are used to store the acquisition properties, and the  quantms files.
@@ -208,21 +228,67 @@ Example of ``AcquisitionProperties``:
 
 === Project files
 
-Recommendations for the file name in the quantms project.
+The files within a project are in the current version <<version>> optional. Files within a project should be listed in the quantms_files, for every file the following information is necessary:
+
+- file_name: The name of the file or folder.
+- is_folder: A boolean value that indicates if the file is a folder or not.
+- partition_fields: The fields that are used to partition the data in the file. This is used to optimize the data retrieval and filtering of the data. This field is optional.
+
+NOTE: Parquet files can be storage as folders when the data is partitioned by some fields. For example, a parquet file that is partitioned by the `sample_accession` field will be stored as a folder with the name of the field and the value of the field.
 
 Example of ``quantms_files``:
 
 [source,json]
 ----
-   "quantms_files": [
-        {"psm_file":   ["PXD004683-550e8400-e29b-41d4.1.psm.parquet",
-                        "PXD004683-550e8400-e29b-41d4.2.psm.parquet"
-        ]},
-        {"feature_file": ["PXD004683-958e8400-e29b-41f4.feature.parquet"]},
-        {"differential_file": ["PXD004683-a716.differential.tsv"]},
-        {"absolute_file":     ["PXD004683-e29b-41f4-a716.absolute.tsv"]},
-        {"sdrf_file":         ["PXD004683-e29b-41f4-a716.sdrf.tsv"]}
-   ]
+   {
+  "quantms_files": [
+    {
+      "psm_file": [
+        {
+          "file_name": "PXD004683-550e8400-e29b-41d4.1.psm.parquet",
+          "is_folder": false
+        },
+        {
+          "file_name": "PXD004683-550e8400-e29b-41d4.2.psm.parquet",
+          "is_folder": false
+        }
+      ]
+    },
+    {
+      "feature_file": [
+        {
+          "file_name": "PXD004683-958e8400-e29b-41f4.featur.parquet",
+          "is_folder": true,
+          "partition_fields": ["sample_accession"]
+        }
+      ]
+    },
+    {
+      "differential_file": [
+        {
+          "file_name": "PXD004683-a716.differential.tsv",
+          "is_folder": false
+        }
+      ]
+    },
+    {
+      "absolute_file": [
+        {
+          "file_name": "PXD004683-e29b-41f4-a716.absolute.tsv",
+          "is_folder": false
+        }
+      ]
+    },
+    {
+      "sdrf_file": [
+        {
+          "file_name": "PXD004683-e29b-41f4-a716.sdrf.tsv",
+          "is_folder": false
+        }
+      ]
+    }
+  ]
+}
 ----
 
 Example:
@@ -273,13 +339,45 @@ Example:
         {"precursor mass tolerance": "20 ppm"},
         {"fragment mass tolerance": "0.6 Da"}
     ],
-    "quantms_files": [
-        {"feature_file": ["PXD014414.feature.parquet"]},
-        {"sdrf_file": ["PXD014414.sdrf.tsv"]},
-        {"psm_file": ["PXD014414-f4fb88f6.psm.parquet"]},
-        {"differential_file": ["PXD014414-3026e5d5.differential.tsv"]}
-    ],
-    "quantms_version": "1.2.0",
+  "quantms_files": [
+    {
+      "feature_file": [
+        {
+          "file_name": "PXD014414.feature.parquet",
+          "is_folder": false
+        }
+      ]
+    },
+    {
+      "sdrf_file": [
+        {
+          "file_name": "PXD014414.sdrf.tsv",
+          "is_folder": false
+        }
+      ]
+    },
+    {
+      "psm_file": [
+        {
+          "file_name": "PXD014414-f4fb88f6.psm.parquet",
+          "is_folder": false
+        }
+      ]
+    },
+    {
+      "differential_file": [
+        {
+          "file_name": "PXD014414-3026e5d5.differential.tsv",
+          "is_folder": false
+        }
+      ]
+    }
+  ]
+  },
+    "software_provider": {
+       "name": "quantms",
+       "version": "1.3.0"
+    },
     "quantmsio_version": "1.0",
     "comments": []
    }
@@ -828,6 +926,11 @@ Some of the fields are shared between the <<feature>> and <<psm>> views, they ca
 | -
 |===
 
+[[diann-scan]]
+===== DIANN Scan
+
+The `DIA-NN` scan is a string that contains the scan number of the MS2 used to identify the peptide. We use the `rt` field and the mzML information to get that number.
+
 ==== Format
 
 The feature view can be found in link:feature.avsc[feature.avsc].

diff --git a/docs/feature.avsc b/docs/feature.avsc
@@ -7,27 +7,27 @@
 		{"name": "modifications", "type": ["null", {"type": "array", "items": "string"}], "doc": "List of modifications as string array, easy for search and filter"},
 		{"name": "modification_details","type": ["null", {"type": "array", "items": "string"}],
 		"doc": "List of alternative site probabilities for the modification format: read the specification for more details"},
-		{"name": "posterior_error_probability", "type": ["null", "double"], "doc": "Posterior error probability for the given peptide spectrum match"},
-		{"name": "global_qvalue", "type": ["null", "double"], "doc": "Global q-value of the peptide or psm at the level of the experiment"},
+		{"name": "posterior_error_probability", "type": ["null", "float32"], "doc": "Posterior error probability for the given peptide spectrum match"},
+		{"name": "global_qvalue", "type": ["null", "float32"], "doc": "Global q-value of the peptide or psm at the level of the experiment"},
 
 		{"name": "is_decoy", "type": "int", "doc": "Decoy indicator, 1 if the PSM is a decoy, 0 target"},
-		{"name": "calculated_mz", "type": "double", "doc": "Theoretical peptide mass-to-charge ratio based on identified sequence and modifications"},
-		{"name": "additional_scores", "type": {"type": "array","items": { "type": "struct", "field": { "name": "string", "value": "double"}}} , 
+		{"name": "calculated_mz", "type": "float32", "doc": "Theoretical peptide mass-to-charge ratio based on identified sequence and modifications"},
+		{"name": "additional_scores", "type": {"type": "array","items": { "type": "struct", "field": { "name": "string", "value": "float32"}}} ,
 		"doc": "List of structures, each structure contains two fields: name and value."},
 
 		{"name": "pg_accessions", "type": ["null", {"type": "array","items": "string"}], "doc": "Protein group accessions of all the proteins that the peptide maps to"},
 		{"name": "pg_positions", "type": ["null", {"type": "array","items": "string"}], "doc": "Protein start and end positions written as start_post:end_post"},
 		{"name": "unique", "type": ["null", "int"], "doc": "Unique peptide indicator, if the peptide maps to a single protein, the value is 1, otherwise 0"},
-		{"name": "protein_global_qvalue", "type": ["null", "double"], "doc": "Global q-value of the protein group at the experiment level"},
+		{"name": "protein_global_qvalue", "type": ["null", "float32"], "doc": "Global q-value of the protein group at the experiment level"},
 		{"name": "gene_accessions", "type": ["null", {"type": "array", "items": "string"}], "doc": "Gene accessions, as string array"},
 		{"name": "gene_names", "type": ["null", {"type": "array", "items": "string"}], "doc": "Gene names, as string array"},
 
 		{"name": "precursor_charge", "type": "int", "doc": "Precursor charge"},
-		{"name": "observed_mz", "type": "double", "doc": "Experimental peptide mass-to-charge ratio of identified peptide (in Da)"},
-		{"name": "rt", "type": ["null", "double"], "doc": "MS2 scan’s precursor retention time (in seconds)"},
-		{"name": "predicted_rt", "type": ["null", "double"], "doc": "Predicted retention time of the peptide (in seconds)"},
+		{"name": "observed_mz", "type": "float32", "doc": "Experimental peptide mass-to-charge ratio of identified peptide (in Da)"},
+		{"name": "rt", "type": ["null", "float32"], "doc": "MS2 scan’s precursor retention time (in seconds)"},
+		{"name": "predicted_rt", "type": ["null", "float32"], "doc": "Predicted retention time of the peptide (in seconds)"},
 
-		{"name": "intensity", "type":  ["null","double"], "doc": "The intensity-based abundance of the peptide in the sample"},
+		{"name": "intensity", "type":  ["null","float32"], "doc": "The intensity-based abundance of the peptide in the sample"},
 
 		{"name": "sample_accession", "type":  ["null","string"], "doc": "The sample accession in the SDRF, which column is called source name"},
 		{"name": "condition", "type":  ["null","string"], "doc": "The value for the factor value column in the SDRF, for example, the tissue factor value[organism part]"},
@@ -39,7 +39,36 @@
 
 		{"name": "psm_reference_file_name", "type": ["null","string"], "doc": "The reference file containing the best psm that identified the feature. Note: This file can be different from the file that contains the feature ().ReferenceFile"},
 		{"name": "psm_scan_number", "type": ["null","string"], "doc": "The scan number of the spectrum. The scan number or index of the spectrum in the file"},
-		{"name": "rt_start", "type": ["null", "double"], "doc": "Start of the retention time window for feature"},
-		{"name": "rt_stop", "type": ["null", "double"], "doc": "End of the retention time window for feature"},		
+		{"name": "rt_start", "type": ["null", "float32"], "doc": "Start of the retention time window for feature"},
+		{"name": "rt_stop", "type": ["null", "float32"], "doc": "End of the retention time window for feature"},
+		{
+      "name": "cv_params",
+      "type": [
+        "null",
+        {
+          "type": "array",
+          "items": {
+            "type": "record",
+            "name": "cv_param",
+            "doc": "Controlled vocabulary (CV) parameters providing additional metadata for the scan.",
+            "fields": [
+              {
+                "name": "name",
+                "type": "string",
+                "doc": "Name of the CV term (e.g., from PSI-MS or other ontologies)."
+              },
+              {
+                "name": "value",
+                "type": "string",
+                "doc": "Value associated with the CV term."
+              }
+            ]
+          }
+        }
+      ],
+      "default": null,
+      "doc": "Optional list of CV parameters for additional metadata."
+    },
+		{"name": "quantmsio_version", "type": "string", "doc": "The version of quantms.io"}
 	]
 }
diff --git a/docs/peptide.avsc b/docs/peptide.avsc
@@ -7,24 +7,25 @@
 		{"name": "modifications", "type": ["null", {"type": "array", "items": "string"}], "doc": "List of modifications as string array, easy for search and filter"},
 		{"name": "modification_details","type": ["null", {"type": "array", "items": "string"}],
 		"doc": "List of alternative site probabilities for the modification format: read the specification for more details"},
-		{"name": "posterior_error_probability", "type": ["null", "double"], "doc": "Posterior error probability for the given peptide spectrum match"},
-		{"name": "global_qvalue", "type": ["null", "double"], "doc": "Global q-value of the peptide or psm at the level of the experiment"},
+		{"name": "posterior_error_probability", "type": ["null", "float32"], "doc": "Posterior error probability for the given peptide spectrum match"},
+		{"name": "global_qvalue", "type": ["null", "float32"], "doc": "Global q-value of the peptide or psm at the level of the experiment"},
 
 		{"name": "is_decoy", "type": "int", "doc": "Decoy indicator, 1 if the PSM is a decoy, 0 target"},
-		{"name": "calculated_mz", "type": "double", "doc": "Theoretical peptide mass-to-charge ratio based on identified sequence and modifications"},
-		{"name": "additional_scores", "type": {"type": "array","items": { "type": "struct", "field": { "name": "string", "value": "double"}}} , 
+		{"name": "calculated_mz", "type": "float32", "doc": "Theoretical peptide mass-to-charge ratio based on identified sequence and modifications"},
+		{"name": "additional_scores", "type": {"type": "array","items": { "type": "struct", "field": { "name": "string", "value": "float32"}}} ,
 		"doc": "List of structures, each structure contains two fields: name and value."},
 
 		{"name": "pg_accessions", "type": ["null",{"type": "array","items": "string"}], "doc": "Protein group accessions of all the proteins that the peptide maps to"},
 		{"name": "pg_positions", "type": ["null",{"type": "array","items": "string"}], "doc": "Protein start and end positions written as start_post:end_post"},
 		{"name": "unique", "type": ["null", "int"], "doc": "Unique peptide indicator, if the peptide maps to a single protein, the value is 1, otherwise 0"},
-		{"name": "protein_global_qvalue", "type": ["null", "double"], "doc": "Global q-value of the protein group at the experiment level"},
+		{"name": "protein_global_qvalue", "type": ["null", "float32"], "doc": "Global q-value of the protein group at the experiment level"},
 		{"name": "gene_accessions", "type": ["null", {"type": "array", "items": "string"}], "doc": "Gene accessions, as string array"},
 		{"name": "gene_names", "type": ["null", {"type": "array", "items": "string"}], "doc": "Gene names, as string array"},
 
 		{"name": "precursor_charge", "type": "int", "doc": "Precursor charge"},
-		{"name": "observed_mz", "type": "double", "doc": "Experimental peptide mass-to-charge ratio of identified peptide (in Da)"},
-		{"name": "rt", "type": ["null", "double"], "doc": "MS2 scan’s precursor retention time (in seconds)"},
-		{"name": "predicted_rt", "type": ["null", "double"], "doc": "Predicted retention time of the peptide (in seconds)"},
+		{"name": "observed_mz", "type": "float32", "doc": "Experimental peptide mass-to-charge ratio of identified peptide (in Da)"},
+		{"name": "rt", "type": ["null", "float32"], "doc": "MS2 scan’s precursor retention time (in seconds)"},
+		{"name": "predicted_rt", "type": ["null", "float32"], "doc": "Predicted retention time of the peptide (in seconds)"},
+		{"name": "quantmsio_version", "type": "string", "doc": "The version of quantms.io"}
 	]
 }
diff --git a/docs/protein.avsc b/docs/protein.avsc
@@ -4,15 +4,16 @@
 	"fields": [
 		{"name": "quantmsio_version", "type": "string", "doc": "The version of the quantms.io specification"},
 		{"name": "protein_accessions", "type": {"type": "array","items": "string"}},
-		{"name": "abundance","type": ["null", "double"]},
+		{"name": "abundance","type": ["null", "float32"]},
 		{"name": "sample_accession", "type":  "string", "doc": "The sample accession in the SDRF, which column is called source name"},
-		{"name": "global_qvalue","type": ["null", "double"] "doc": "The global qvalue for a given protein or protein groups"},
+		{"name": "global_qvalue","type": ["null", "float32"], "doc": "The global qvalue for a given protein or protein groups"},
 		{"name": "is_decoy","type": ["null", "int"], "doc": "If the protein is decoy"},
 		{"name": "best_id_score", "type": "string", "doc": "The best search engine score for the identification"},
 		{"name": "gene_accessions", "type": ["null", {"type": "array","items": "string"}], "doc": "The gene accessions corresponding to every protein"},
-		{"name": "gene_names", "type": ["null", {"type": "array","items": "string"}] "doc": "The gene names corresponding to every protein"},
+		{"name": "gene_names", "type": ["null", {"type": "array","items": "string"}], "doc": "The gene names corresponding to every protein"},
 		{"name": "number_peptides","type": ["null", "int"], "doc": "The total number of peptides for a give protein"},
 		{"name": "number_psms","type": ["null", "int"], "doc": "The total number of peptide spectrum matches"},
-		{"name": "number_nique_peptides","type": ["null", "int"], "doc": "The total number of unique peptides"},
+		{"name": "number_unique_peptides","type": ["null", "int"], "doc": "The total number of unique peptides"},
+		{"name": "quantmsio_version", "type": "string", "doc": "The version of quantms.io"}
 	]
 }