Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor refinements about software and cvterms #55

Merged
merged 6 commits into from
Sep 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 124 additions & 21 deletions docs/README.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ For example:
[[identification-scores]]
=== Identification Scores

Every workflow within quantms uses different identification scores to determinate the quality of the identification. `IdScores` in quantms try to capture multiple scores from different workflows such as the `Comet:xcorr` or `DIA-NN:Q.Value`. The identification scores are stored as a key/value pair where the key is the name of the score (is RECOMMENDED to use HUPO-PSI MS ontology) and the value is the score value. This concept is used in the following outputs:
Every workflow within quantms uses different identification scores to determinate the quality of the identification. `additional_scores` in quantms try to capture multiple scores from different workflows such as the `Comet:xcorr` or `DIA-NN:Q.Value`. The identification scores are stored as a key/value pair where the key is the name of the score (is RECOMMENDED to use HUPO-PSI MS ontology) and the value is the score value. This concept is used in the following outputs:

- `["'Comet:xcorr':67.8", "'DIA-NN:Q.Value':0.01"]`

Expand All @@ -113,6 +113,15 @@ This concept is used in the following outputs:
- <<feature>>
- <<peptide>>

[[cv-terms]]
=== Controlled Vocabulary Terms

The following views <<psm>> <<feature>> use controlled vocabularies to describe the data. The controlled vocabulary terms are used to standardize the data and make it easier to integrate with other datasets. The controlled vocabulary terms are stored as a key/value pair where the key is the name of the controlled vocabulary term and the value is the term value. This concept is used in the following outputs:

- `["ms level": "2", "deconvoluted data": null]`

The name/key of the controlled vocabulary MUST be provided; the value is optional.

[[serialization]]
== Serialization formats

Expand Down Expand Up @@ -164,7 +173,18 @@ The structure of the version is as follows `{major release}.{minor update}`: The
- All views and serialization formats will have a version number in the way: `quantmsio_version: {}`. This will help to identify the version of the specification used to generate the file.
- Major release changes will be backward incompatible, while minor updates will be backward compatible.

[[psm]]
[[software]]
== Software provider

The data within quantms.io is mainly generated from https://github.com/bigbio/quantms[quantms workflow]. However, the format is open and can be used by any software provider that wants to generate the data in this format. The software provider and the version of the software used to generate the data will be stored in the project view <<project>> as:

[source,json]
----
"software_provider": {
"name": "quantms",
"version": "1.3.0"
}
----

[[project]]
== Project quantms.io
Expand All @@ -190,8 +210,8 @@ The project view is the file that stores the metadata of the entire `quantms.io`
| `acquisition_properties` | Properties of the data acquisition methods | list[key/value]
| `quantms_files` | Files related to quantMS analysis | list[key/value]
| `quantmsio_version` | Version of the `quantms.io` | String
| `quantmsversion` | Version of the quantms workflow | String
| `comments` | Additional comments or notes | List of Strings
| `software_provider` | The software used to generate the data <<software>> | Key/Value
| `comments` | Additional comments or notes | List of Strings
|===

Key/Value pair object: The key/value pairs are used to store the acquisition properties, and the quantms files.
Expand All @@ -208,21 +228,67 @@ Example of ``AcquisitionProperties``:

=== Project files

Recommendations for the file name in the quantms project.
The files within a project are in the current version <<version>> optional. Files within a project should be listed in the quantms_files, for every file the following information is necessary:

- file_name: The name of the file or folder.
- is_folder: A boolean value that indicates if the file is a folder or not.
- partition_fields: The fields that are used to partition the data in the file. This is used to optimize the data retrieval and filtering of the data. This field is optional.

NOTE: Parquet files can be storage as folders when the data is partitioned by some fields. For example, a parquet file that is partitioned by the `sample_accession` field will be stored as a folder with the name of the field and the value of the field.

Example of ``quantms_files``:

[source,json]
----
"quantms_files": [
{"psm_file": ["PXD004683-550e8400-e29b-41d4.1.psm.parquet",
"PXD004683-550e8400-e29b-41d4.2.psm.parquet"
]},
{"feature_file": ["PXD004683-958e8400-e29b-41f4.feature.parquet"]},
{"differential_file": ["PXD004683-a716.differential.tsv"]},
{"absolute_file": ["PXD004683-e29b-41f4-a716.absolute.tsv"]},
{"sdrf_file": ["PXD004683-e29b-41f4-a716.sdrf.tsv"]}
]
{
"quantms_files": [
{
"psm_file": [
{
"file_name": "PXD004683-550e8400-e29b-41d4.1.psm.parquet",
"is_folder": false
},
{
"file_name": "PXD004683-550e8400-e29b-41d4.2.psm.parquet",
"is_folder": false
}
]
},
{
"feature_file": [
{
"file_name": "PXD004683-958e8400-e29b-41f4.featur.parquet",
"is_folder": true,
"partition_fields": ["sample_accession"]
}
]
},
{
"differential_file": [
{
"file_name": "PXD004683-a716.differential.tsv",
"is_folder": false
}
]
},
{
"absolute_file": [
{
"file_name": "PXD004683-e29b-41f4-a716.absolute.tsv",
"is_folder": false
}
]
},
{
"sdrf_file": [
{
"file_name": "PXD004683-e29b-41f4-a716.sdrf.tsv",
"is_folder": false
}
]
}
]
}
----

Example:
Expand Down Expand Up @@ -273,13 +339,45 @@ Example:
{"precursor mass tolerance": "20 ppm"},
{"fragment mass tolerance": "0.6 Da"}
],
"quantms_files": [
{"feature_file": ["PXD014414.feature.parquet"]},
{"sdrf_file": ["PXD014414.sdrf.tsv"]},
{"psm_file": ["PXD014414-f4fb88f6.psm.parquet"]},
{"differential_file": ["PXD014414-3026e5d5.differential.tsv"]}
],
"quantms_version": "1.2.0",
"quantms_files": [
{
"feature_file": [
{
"file_name": "PXD014414.feature.parquet",
"is_folder": false
}
]
},
{
"sdrf_file": [
{
"file_name": "PXD014414.sdrf.tsv",
"is_folder": false
}
]
},
{
"psm_file": [
{
"file_name": "PXD014414-f4fb88f6.psm.parquet",
"is_folder": false
}
]
},
{
"differential_file": [
{
"file_name": "PXD014414-3026e5d5.differential.tsv",
"is_folder": false
}
]
}
]
},
"software_provider": {
"name": "quantms",
"version": "1.3.0"
},
"quantmsio_version": "1.0",
"comments": []
}
Expand Down Expand Up @@ -828,6 +926,11 @@ Some of the fields are shared between the <<feature>> and <<psm>> views, they ca
| -
|===

[[diann-scan]]
===== DIANN Scan

The `DIA-NN` scan is a string that contains the scan number of the MS2 used to identify the peptide. We use the `rt` field and the mzML information to get that number.

==== Format

The feature view can be found in link:feature.avsc[feature.avsc].
Expand Down
51 changes: 40 additions & 11 deletions docs/feature.avsc
Original file line number Diff line number Diff line change
Expand Up @@ -7,27 +7,27 @@
{"name": "modifications", "type": ["null", {"type": "array", "items": "string"}], "doc": "List of modifications as string array, easy for search and filter"},
{"name": "modification_details","type": ["null", {"type": "array", "items": "string"}],
"doc": "List of alternative site probabilities for the modification format: read the specification for more details"},
{"name": "posterior_error_probability", "type": ["null", "double"], "doc": "Posterior error probability for the given peptide spectrum match"},
{"name": "global_qvalue", "type": ["null", "double"], "doc": "Global q-value of the peptide or psm at the level of the experiment"},
{"name": "posterior_error_probability", "type": ["null", "float32"], "doc": "Posterior error probability for the given peptide spectrum match"},
{"name": "global_qvalue", "type": ["null", "float32"], "doc": "Global q-value of the peptide or psm at the level of the experiment"},

{"name": "is_decoy", "type": "int", "doc": "Decoy indicator, 1 if the PSM is a decoy, 0 target"},
{"name": "calculated_mz", "type": "double", "doc": "Theoretical peptide mass-to-charge ratio based on identified sequence and modifications"},
{"name": "additional_scores", "type": {"type": "array","items": { "type": "struct", "field": { "name": "string", "value": "double"}}} ,
{"name": "calculated_mz", "type": "float32", "doc": "Theoretical peptide mass-to-charge ratio based on identified sequence and modifications"},
{"name": "additional_scores", "type": {"type": "array","items": { "type": "struct", "field": { "name": "string", "value": "float32"}}} ,
"doc": "List of structures, each structure contains two fields: name and value."},

{"name": "pg_accessions", "type": ["null", {"type": "array","items": "string"}], "doc": "Protein group accessions of all the proteins that the peptide maps to"},
{"name": "pg_positions", "type": ["null", {"type": "array","items": "string"}], "doc": "Protein start and end positions written as start_post:end_post"},
{"name": "unique", "type": ["null", "int"], "doc": "Unique peptide indicator, if the peptide maps to a single protein, the value is 1, otherwise 0"},
{"name": "protein_global_qvalue", "type": ["null", "double"], "doc": "Global q-value of the protein group at the experiment level"},
{"name": "protein_global_qvalue", "type": ["null", "float32"], "doc": "Global q-value of the protein group at the experiment level"},
{"name": "gene_accessions", "type": ["null", {"type": "array", "items": "string"}], "doc": "Gene accessions, as string array"},
{"name": "gene_names", "type": ["null", {"type": "array", "items": "string"}], "doc": "Gene names, as string array"},

{"name": "precursor_charge", "type": "int", "doc": "Precursor charge"},
{"name": "observed_mz", "type": "double", "doc": "Experimental peptide mass-to-charge ratio of identified peptide (in Da)"},
{"name": "rt", "type": ["null", "double"], "doc": "MS2 scan’s precursor retention time (in seconds)"},
{"name": "predicted_rt", "type": ["null", "double"], "doc": "Predicted retention time of the peptide (in seconds)"},
{"name": "observed_mz", "type": "float32", "doc": "Experimental peptide mass-to-charge ratio of identified peptide (in Da)"},
{"name": "rt", "type": ["null", "float32"], "doc": "MS2 scan’s precursor retention time (in seconds)"},
{"name": "predicted_rt", "type": ["null", "float32"], "doc": "Predicted retention time of the peptide (in seconds)"},

{"name": "intensity", "type": ["null","double"], "doc": "The intensity-based abundance of the peptide in the sample"},
{"name": "intensity", "type": ["null","float32"], "doc": "The intensity-based abundance of the peptide in the sample"},

{"name": "sample_accession", "type": ["null","string"], "doc": "The sample accession in the SDRF, which column is called source name"},
{"name": "condition", "type": ["null","string"], "doc": "The value for the factor value column in the SDRF, for example, the tissue factor value[organism part]"},
Expand All @@ -39,7 +39,36 @@

{"name": "psm_reference_file_name", "type": ["null","string"], "doc": "The reference file containing the best psm that identified the feature. Note: This file can be different from the file that contains the feature ().ReferenceFile"},
{"name": "psm_scan_number", "type": ["null","string"], "doc": "The scan number of the spectrum. The scan number or index of the spectrum in the file"},
{"name": "rt_start", "type": ["null", "double"], "doc": "Start of the retention time window for feature"},
{"name": "rt_stop", "type": ["null", "double"], "doc": "End of the retention time window for feature"},
{"name": "rt_start", "type": ["null", "float32"], "doc": "Start of the retention time window for feature"},
{"name": "rt_stop", "type": ["null", "float32"], "doc": "End of the retention time window for feature"},
{
"name": "cv_params",
"type": [
"null",
{
"type": "array",
"items": {
"type": "record",
"name": "cv_param",
"doc": "Controlled vocabulary (CV) parameters providing additional metadata for the scan.",
"fields": [
{
"name": "name",
"type": "string",
"doc": "Name of the CV term (e.g., from PSI-MS or other ontologies)."
},
{
"name": "value",
"type": "string",
"doc": "Value associated with the CV term."
}
]
}
}
],
"default": null,
"doc": "Optional list of CV parameters for additional metadata."
},
{"name": "quantmsio_version", "type": "string", "doc": "The version of quantms.io"}
]
}
17 changes: 9 additions & 8 deletions docs/peptide.avsc
Original file line number Diff line number Diff line change
Expand Up @@ -7,24 +7,25 @@
{"name": "modifications", "type": ["null", {"type": "array", "items": "string"}], "doc": "List of modifications as string array, easy for search and filter"},
{"name": "modification_details","type": ["null", {"type": "array", "items": "string"}],
"doc": "List of alternative site probabilities for the modification format: read the specification for more details"},
{"name": "posterior_error_probability", "type": ["null", "double"], "doc": "Posterior error probability for the given peptide spectrum match"},
{"name": "global_qvalue", "type": ["null", "double"], "doc": "Global q-value of the peptide or psm at the level of the experiment"},
{"name": "posterior_error_probability", "type": ["null", "float32"], "doc": "Posterior error probability for the given peptide spectrum match"},
{"name": "global_qvalue", "type": ["null", "float32"], "doc": "Global q-value of the peptide or psm at the level of the experiment"},

{"name": "is_decoy", "type": "int", "doc": "Decoy indicator, 1 if the PSM is a decoy, 0 target"},
{"name": "calculated_mz", "type": "double", "doc": "Theoretical peptide mass-to-charge ratio based on identified sequence and modifications"},
{"name": "additional_scores", "type": {"type": "array","items": { "type": "struct", "field": { "name": "string", "value": "double"}}} ,
{"name": "calculated_mz", "type": "float32", "doc": "Theoretical peptide mass-to-charge ratio based on identified sequence and modifications"},
{"name": "additional_scores", "type": {"type": "array","items": { "type": "struct", "field": { "name": "string", "value": "float32"}}} ,
"doc": "List of structures, each structure contains two fields: name and value."},

{"name": "pg_accessions", "type": ["null",{"type": "array","items": "string"}], "doc": "Protein group accessions of all the proteins that the peptide maps to"},
{"name": "pg_positions", "type": ["null",{"type": "array","items": "string"}], "doc": "Protein start and end positions written as start_post:end_post"},
{"name": "unique", "type": ["null", "int"], "doc": "Unique peptide indicator, if the peptide maps to a single protein, the value is 1, otherwise 0"},
{"name": "protein_global_qvalue", "type": ["null", "double"], "doc": "Global q-value of the protein group at the experiment level"},
{"name": "protein_global_qvalue", "type": ["null", "float32"], "doc": "Global q-value of the protein group at the experiment level"},
{"name": "gene_accessions", "type": ["null", {"type": "array", "items": "string"}], "doc": "Gene accessions, as string array"},
{"name": "gene_names", "type": ["null", {"type": "array", "items": "string"}], "doc": "Gene names, as string array"},

{"name": "precursor_charge", "type": "int", "doc": "Precursor charge"},
{"name": "observed_mz", "type": "double", "doc": "Experimental peptide mass-to-charge ratio of identified peptide (in Da)"},
{"name": "rt", "type": ["null", "double"], "doc": "MS2 scan’s precursor retention time (in seconds)"},
{"name": "predicted_rt", "type": ["null", "double"], "doc": "Predicted retention time of the peptide (in seconds)"},
{"name": "observed_mz", "type": "float32", "doc": "Experimental peptide mass-to-charge ratio of identified peptide (in Da)"},
{"name": "rt", "type": ["null", "float32"], "doc": "MS2 scan’s precursor retention time (in seconds)"},
{"name": "predicted_rt", "type": ["null", "float32"], "doc": "Predicted retention time of the peptide (in seconds)"},
{"name": "quantmsio_version", "type": "string", "doc": "The version of quantms.io"}
]
}
9 changes: 5 additions & 4 deletions docs/protein.avsc
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,16 @@
"fields": [
{"name": "quantmsio_version", "type": "string", "doc": "The version of the quantms.io specification"},
{"name": "protein_accessions", "type": {"type": "array","items": "string"}},
{"name": "abundance","type": ["null", "double"]},
{"name": "abundance","type": ["null", "float32"]},
{"name": "sample_accession", "type": "string", "doc": "The sample accession in the SDRF, which column is called source name"},
{"name": "global_qvalue","type": ["null", "double"] "doc": "The global qvalue for a given protein or protein groups"},
{"name": "global_qvalue","type": ["null", "float32"], "doc": "The global qvalue for a given protein or protein groups"},
{"name": "is_decoy","type": ["null", "int"], "doc": "If the protein is decoy"},
{"name": "best_id_score", "type": "string", "doc": "The best search engine score for the identification"},
{"name": "gene_accessions", "type": ["null", {"type": "array","items": "string"}], "doc": "The gene accessions corresponding to every protein"},
{"name": "gene_names", "type": ["null", {"type": "array","items": "string"}] "doc": "The gene names corresponding to every protein"},
{"name": "gene_names", "type": ["null", {"type": "array","items": "string"}], "doc": "The gene names corresponding to every protein"},
{"name": "number_peptides","type": ["null", "int"], "doc": "The total number of peptides for a give protein"},
{"name": "number_psms","type": ["null", "int"], "doc": "The total number of peptide spectrum matches"},
{"name": "number_nique_peptides","type": ["null", "int"], "doc": "The total number of unique peptides"},
{"name": "number_unique_peptides","type": ["null", "int"], "doc": "The total number of unique peptides"},
{"name": "quantmsio_version", "type": "string", "doc": "The version of quantms.io"}
]
}
Loading
Loading