AlexsLemonade · sjspielman · Nov 15, 2023 · Nov 8, 2023 · Nov 8, 2023 · Nov 8, 2023
diff --git a/components/dictionary.txt b/components/dictionary.txt
@@ -5,6 +5,7 @@ alevin
 Alevin
 APA
 AnnData
+Aran
 barcode
 barcodes
 basepairs
@@ -31,6 +32,7 @@ Ensembl
 et
 fastp
 FASTQ
+Franzén
 GC
 Genomics
 github
@@ -82,6 +84,7 @@ Stegle
 Stoeckius
 subdiagnosis
 submitter
+submitters
 Tian
 transcriptome
 transcriptomics

diff --git a/docs/getting_started.md b/docs/getting_started.md
@@ -182,7 +182,7 @@ processed_adata.uns["highly_variable_genes"]
 ### Clustering
 
 Cluster assignments obtained from [Graph-based clustering](http://bioconductor.org/books/3.16/OSCA.basic/clustering.html#clustering-graph) is also available in the processed objects.
-Here clustering was performed using the Louvain algorithm with 20 nearest neighbors and Jaccard weighting.
+By default, clustering is performed using the Louvain algorithm with 20 nearest neighbors and Jaccard weighting.
 
 To access the cluster assignments in the `SingleCellExperiment` object, use the following command:
 
@@ -203,6 +203,129 @@ See these resources for more information on clustering:
  - [Quantifying clustering behavior in Orchestrating Single Cell Analysis](https://bioconductor.org/books/release/OSCA.advanced/clustering-redux.html#quantifying-clustering-behavior)
 
 
+### Cell type annotation
+
+Processed objects may contain cell type annotations and associated metadata from one or more of the following sources.
+
+- Submitter-provided annotations (note that these are only present for a subset of libraries).
+- Annotations determined by [`SingleR`](https://bioconductor.org/packages/release/bioc/html/SingleR.html), an automated reference-based method ([Looney _et al._ 2019](https://doi.org/10.1038/s41590-018-0276-y)).
+- Annotations determined by [`CellAssign`](https://github.com/Irrationone/cellassign), an automated marker-gene based method ([Zhang _et al._ 2019](https://doi.org/10.1038/s41592-019-0529-1)).
+
+If at least one method was used for cell type annotation, a supplemental cell type report will be provided with the download.
+This report evaluates cell annotations results as follows:
+
+- It provides diagnostic plots to assess the quality of cell type annotations.
+- If multiple annotations are present, the report compares different annotations to one another.
+Strong agreement between different annotation methods is a qualitative indicator of robustness.
+
+To determine which methods were used for cell type annotations, use the following command on the processed `SingleCellExperiment` object:
+
+```r
+# show vector of available celltypes
+# values will be one or more of: `submitter`, `singler`, `cellassign`
+metadata(processed_sce)$celltype_methods
+```
+
+Or, on the processed `AnnData` object:
+
+```python
+# show list of available celltypes
+# values will be one or more of: `submitter`, `singler`, `cellassign`
+processed_adata.uns["celltype_methods"]
+```
+
+Below we provide instructions on how to access annotations from each cell type annotation method used, if present.
+
+Note that additional information about `SingleR` and `CellAssign` annotation, including their respective reference source and versions, is also available from the processed `SingleCellExperiment` object's metadata and from the processed `AnnData` object's `uns` slot, as described in the {ref}`experiment metadata table<sce_file_contents:SingleCellExperiment experiment metadata>`.
+
+#### Submitter-provided annotations
+
+To access submitter-provided annotations, if available, in the `SingleCellExperiment`, use the following command:
+
+```r
+# submitter-provided annotations for each cell
+processed_sce$submitter_celltype_annotation
+```
+
+To access submitter-provided annotations in the `AnnData` object, use the following command:
+
+```python
+# submitter-provided annotations for each cell
+processed_adata.obs["submitter_celltype_annotation"]
+```
+
+Cells that submitters did not annotate are labeled with `Submitter-excluded`.
+Note that submitter-provided annotations are also present in unfiltered and filtered objects and can be accessed using the same approach shown here for processed objects.
+
+
+#### `SingleR` annotations
+
+`SingleR` annotation uses a reference dataset from the [`celldex` package](https://bioconductor.org/packages/release/data/experiment/html/celldex.html) [[Aran _et al._ (2019)](https://doi.org/10.1038/s41590-018-0276-y)].
+
+
+To access automated `SingleR` annotations as cell type names and/or ontology terms in the process `SingleCellExperiment` object, use the following command(s):
+
+```r
+# SingleR annotatins as cell type names
+processed_sce$singler_celltype_annotation
+
+# Or, SingleR annotatins as cell type ontology terms
+processed_sce$singler_celltype_ontology
+```
+
+To access automated `SingleR` annotations as cell type names and/or ontology terms in the processed `AnnData` object, use the following command(s):
+
+```python
+# SingleR annotatins as cell type names
+processed_adata.obs["singler_celltype_annotation"]
+
+# Or, SingleR annotatins as cell type ontology terms
+processed_adata.obs["singler_celltype_ontology"]
+```
+
+Cells that `SingleR` could not confidently annotate are labeled with `NA`.
+
+You can also access the full object returned by `SingleR` from the `SingleCellExperiment`'s metadata with the following command (note that this information is not provided in the `AnnData` object):
+
+```r
+# SingleR full result
+metadata(processed_sce)$singler_results
+```
+
+#### `CellAssign` annotations
+
+`CellAssign` annotation uses a reference set of marker genes from the [`PanglaoDB` database](https://panglaodb.se/) [[Oscar Franzén _et al._ (2019)](https://doi.org/10.1093/database/baz046)], as compiled by the Data Lab for a given tissue group.
+
+To access automated `CellAssign` annotations in the `SingleCellExperiment`, use the following command:
+
+```r
+# CellAssign annotations for each cell
+processed_sce$cellassign_celltype_annotation
+```
+
+To access automated `CellAssign` annotations in the `AnnData` object, use the following command:
+
+```python
+# CellAssign annotations for each cell
+processed_adata.obs["cellassign_celltype_annotation"]
+```
+
+Cells that `CellAssign` could not confidently annotate are labeled with `"other"`.
+
+You can also access the full predictions matrix returned by `CellAssign` from the `SingleCellExperiment`'s metadata with the following command:
+
+```r
+# CellAssign full predictions matrix full result
+metadata(processed_sce)$cellassign_predictions
+```
+
+#### Additional cell type resources
+
+See these resources for more information on automated cell type annotation:
+
+- [Assigning cell types with `SingleR`](https://bioconductor.org/books/release/SingleRBook/)
+- [Cell type assignment chapter in Orchestrating Single Cell Analysis](https://bioconductor.org/books/3.17/OSCA.advanced/cell-cycle-assignment.html)
+
 ## What if I want to use Seurat?
 
 The files available for download that contain [`SingleCellExperiment` objects](http://bioconductor.org/books/3.13/OSCA.intro/the-singlecellexperiment-class.html) can also be converted into Seurat objects.

diff --git a/docs/sce_file_contents.md b/docs/sce_file_contents.md
@@ -137,10 +137,15 @@ metadata(sce) # experiment metadata
 | `cluster_nn`        | The nearest neighbor parameter value used for the graph-based clustering. Only present for `processed` objects |
 | `celltype_methods` | If cell type annotation was performed, a vector of the methods used for annotation. May include `"submitter"`, `"singler"` and/or `"cellassign"`. If submitter cell-type annotations are available, this metadata item will be present in all objects. Otherwise, this item will only be in `processed` objects |
 | `singler_results` | If cell typing with `SingleR` was performed, the full result object returned by `SingleR` annotation. Only present for `processed` objects |
-| `singler_reference` | If cell typing with `SingleR` was performed, the name of the [`celldex`](http://bioconductor.org/packages/release/data/experiment/html/celldex.html) reference dataset used for annotation. Only present for `processed` objects |
+| `singler_reference` | If cell typing with `SingleR` was performed, the name of the reference dataset used for annotation. Only present for `processed` objects |
 | `singler_reference_label` | If cell typing with `SingleR` was performed, the name of the label in the reference dataset used for annotation. Only present for `processed` objects |
+| `singler_reference_source`  | If cell typing with `SingleR` was performed, the source of the reference dataset (default is [`celldex`](http://bioconductor.org/packages/release/data/experiment/html/celldex.html)). Only present for `processed` objects |
+| `singler_reference_version`  | If cell typing with `SingleR` was performed, the version of `celldex` used to create the reference dataset source, with periods replaced as dashes (`-`). Only present for `processed` objects |
 | `cellassign_predictions` | If cell typing with `CellAssign` was performed, the full matrix of predictions across cells and cell types. Only present for `processed` objects |
-| `cellassign_reference` | If cell typing with `CellAssign` was performed, the name of the organ/tissue type for which marker genes were obtained from `PanglaoDB`. Only present for `processed` objects |
+| `cellassign_reference` | If cell typing with `CellAssign` was performed, the name of the organ/tissue grouping for which marker genes were obtained. Only present for `processed` objects |
+| `cellassign_reference_source`  | If cell typing with `CellAssign` was performed, the source of the reference dataset (default is [`PanglaoDB`](https://panglaodb.se/)). Only present for `processed` objects |
+| `cellassign_reference_version`  | If cell typing with `CellAssign` was performed, the version of the reference dataset source. For references obtained from `PanglaoDB`, the version scheme is a date in ISO8601 format. Only present for `processed` objects |
+
 
 ### SingleCellExperiment sample metadata
 
@@ -162,12 +167,12 @@ The following columns are included in the sample metadata data frame for all lib
 | `tissue_location` | Where in the body the tumor sample was located                 |
 | `disease_timing`  | At what stage of disease the sample was obtained, either diagnosis or recurrence |
 | `organism`         | The organism the sample was obtained from (e.g., `Homo_sapiens`) |
-| `development_stage_ontology_term_id` | [`HsapDv` ontology](http://obofoundry.org/ontology/hsapdv.html) term indicating developmental stage. If unavailable, `unknown` is used.  |
-| `sex_ontology_term_id` | [`PATO`](http://obofoundry.org/ontology/pato.html) term referring to the sex of the sample. If unavailable, `unknown` is used. |
-| `organism_ontology_id` | [NCBI taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy) term for organism, e.g. [`NCBITaxon:9606`](http://purl.obolibrary.org/obo/NCBITaxon_9606). |
-| `self_reported_ethnicity_ontology_term_id` | For _Homo sapiens_, a [`Hancestro` term](http://obofoundry.org/ontology/hancestro.html). `multiethnic` indicates more than one ethnicity is reported. `unknown` indicates unavailable ethnicity and `NA` is used for all other organisms.  |
-| `disease_ontology_term_id` | [`MONDO`](http://obofoundry.org/ontology/mondo.html) term indicating disease type. [`PATO:0000461`](http://purl.obolibrary.org/obo/PATO_0000461) indicates normal or healthy tissue. If unavailable, `NA` is used.  |
-| `tissue_ontology_term_id`| [`UBERON`](http://obofoundry.org/ontology/uberon.html) term indicating tissue of origin. If unavailable, `NA` is used. |
+| `development_stage_ontology_term_id` | [`HsapDv` ontology](http://obofoundry.org/ontology/hsapdv.html) term indicating developmental stage. If unavailable, `unknown` is used  |
+| `sex_ontology_term_id` | [`PATO`](http://obofoundry.org/ontology/pato.html) term referring to the sex of the sample. If unavailable, `unknown` is used |
+| `organism_ontology_id` | [NCBI taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy) term for organism, e.g. [`NCBITaxon:9606`](http://purl.obolibrary.org/obo/NCBITaxon_9606) |
+| `self_reported_ethnicity_ontology_term_id` | For _Homo sapiens_, a [`Hancestro` term](http://obofoundry.org/ontology/hancestro.html). `multiethnic` indicates more than one ethnicity is reported. `unknown` indicates unavailable ethnicity and `NA` is used for all other organisms  |
+| `disease_ontology_term_id` | [`MONDO`](http://obofoundry.org/ontology/mondo.html) term indicating disease type. [`PATO:0000461`](http://purl.obolibrary.org/obo/PATO_0000461) indicates normal or healthy tissue. If unavailable, `NA` is used  |
+| `tissue_ontology_term_id`| [`UBERON`](http://obofoundry.org/ontology/uberon.html) term indicating tissue of origin. If unavailable, `NA` is used |
 
 For some libraries, the sample metadata may also include additional metadata specific to the disease type and experimental design of the project.
 Examples of this include treatment or outcome.
@@ -368,13 +373,13 @@ The `AnnData` object also includes the following additional cell-level metadata
 | `tissue_location` | Where in the body the tumor sample was located                 |
 | `disease_timing`  | At what stage of disease the sample was obtained, either diagnosis or recurrence |
 | `organism`         | The organism the sample was obtained from (e.g., `Homo_sapiens`) |
-| `development_stage_ontology_term_id` | [`HsapDv` ontology](http://obofoundry.org/ontology/hsapdv.html) term indicating developmental stage. If unavailable, `unknown` is used.  |
-| `sex_ontology_term_id` | [`PATO`](http://obofoundry.org/ontology/pato.html) term referring to the sex of the sample. If unavailable, `unknown` is used. |
-| `organism_ontology_id` | [NCBI taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy) term for organism, e.g. [`NCBITaxon:9606`](http://purl.obolibrary.org/obo/NCBITaxon_9606). |
-| `self_reported_ethnicity_ontology_term_id` | For _Homo sapiens_, a [`HANCESTRO` term](http://obofoundry.org/ontology/hancestro.html). `multiethnic` indicates more than one ethnicity is reported. `unknown` indicates unavailable ethnicity, and `NA` is used for all other organisms.  |
-| `disease_ontology_term_id` | [`Mondo`](http://obofoundry.org/ontology/mondo.html) term indicating disease type. [`PATO:0000461`](http://purl.obolibrary.org/obo/PATO_0000461) indicates normal or healthy tissue. If unavailable, `NA` is used.  |
-| `tissue_ontology_term_id`| [`Uberon`](http://obofoundry.org/ontology/uberon.html) term indicating tissue of origin. If unavailable, `NA` is used. |
-| `is_primary_data` | Set to `FALSE` for all libraries to reflect that all libraries were obtained from external investigators. Required by `CELLxGENE`.             |
+| `development_stage_ontology_term_id` | [`HsapDv` ontology](http://obofoundry.org/ontology/hsapdv.html) term indicating developmental stage. If unavailable, `unknown` is used  |
+| `sex_ontology_term_id` | [`PATO`](http://obofoundry.org/ontology/pato.html) term referring to the sex of the sample. If unavailable, `unknown` is used |
+| `organism_ontology_id` | [NCBI taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy) term for organism, e.g. [`NCBITaxon:9606`](http://purl.obolibrary.org/obo/NCBITaxon_9606) |
+| `self_reported_ethnicity_ontology_term_id` | For _Homo sapiens_, a [`HANCESTRO` term](http://obofoundry.org/ontology/hancestro.html). `multiethnic` indicates more than one ethnicity is reported. `unknown` indicates unavailable ethnicity, and `NA` is used for all other organisms  |
+| `disease_ontology_term_id` | [`Mondo`](http://obofoundry.org/ontology/mondo.html) term indicating disease type. [`PATO:0000461`](http://purl.obolibrary.org/obo/PATO_0000461) indicates normal or healthy tissue. If unavailable, `NA` is used  |
+| `tissue_ontology_term_id`| [`Uberon`](http://obofoundry.org/ontology/uberon.html) term indicating tissue of origin. If unavailable, `NA` is used |
+| `is_primary_data` | Set to `FALSE` for all libraries to reflect that all libraries were obtained from external investigators. Required by `CELLxGENE`             |
 
 
 ### AnnData gene information and metrics