From 38d2a69928ba42600d488c39a556b7187592c55e Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Thu, 16 Jan 2025 11:59:29 -0500 Subject: [PATCH 1/9] add draft for seurat section --- docs/troubleshooting-faq/faq.md | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/docs/troubleshooting-faq/faq.md b/docs/troubleshooting-faq/faq.md index d339e1870..b6cd3e31e 100644 --- a/docs/troubleshooting-faq/faq.md +++ b/docs/troubleshooting-faq/faq.md @@ -1,8 +1,8 @@ # Frequently asked questions -### Why didn't the sample/project I specified when running the [data download script](../getting-started/accessing-resources/getting-access-to-data.md#using-the-download-data-script) download? +### Why didn't the sample/project I specified when running the data download script download? -First, we recommend using the `--dryrun` flag when running the `download-data.py` script to check which files _would_ be downloaded. +First, we recommend using the `--dryrun` flag when running the [data download script](../getting-started/accessing-resources/getting-access-to-data.md#using-the-download-data-script) to check which files _would_ be downloaded. This will confirm that there is nothing wrong with your internet connection and that you are properly [logged into your AWS profile](../technical-setup/environment-setup/configure-aws-cli.md#logging-in-to-a-new-session). If running the script with `--dryrun` states that _only_ the `DATA_USAGE.md` file is being downloaded, this means the data files you are attempting to download do not exist. @@ -117,3 +117,19 @@ However, there may be circumstances when you want to use results from a module w In such cases, you will need to run the module yourself to generate the results. Instructions for [running the module](../contributing-to-analyses/analysis-modules/running-a-module.md), including its software and compute requirements, should be available in the module's main `README.md` file. After running the module, results will generally be stored in `analysis/{module name}/results`, and the module's documentation should describe the contents of result files. + + +### What if I want to use Seurat? + +While [data downloads](../getting-started/accessing-resources/getting-access-to-data.md) are only available in `SingleCellExperiment` and `AnnData` format, `Seurat` versions of all objects (in [v5 assay format](https://satijalab.org/seurat/articles/seurat5_essential_commands)) are also available for use. + +These files are part of the OpenScPCA results, associated with the module [`seurat-conversion`](https://github.com/AlexsLemonade/OpenScPCA-analysis/tree/main/analyses/seurat-conversion) which we wrote to convert the processed `SingleCellExperiment` objects to `Seurat` format. +For more information on obtaining result files, please refer to the documentation for [the `download-results.py` script](../getting-started/accessing-resources/getting-access-to-data.md#accessing-scpca-module-results). + +When working with these `Seurat` objects, please bear in mind the following: + +* They were _not_ processed with a `Seurat` pipeline. +They were processed using the same pipeline as all OpenScPCA objects (e.g., with `Bioconductor`), and then converted to a `Seurat` format + * Notably, they do contain the raw data counts, allowing you to perform normalization, dimension reduction, etc. with `Seurat` directly if you so choose +* To be more consistent with `Seurat` analysis pipelines, gene names in these objects use gene symbols rather than Ensembl ids + From e08e68335516425155e218b7077c37ca30c99b05 Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Thu, 16 Jan 2025 12:06:50 -0500 Subject: [PATCH 2/9] Add draft of gene conversion section --- docs/troubleshooting-faq/faq.md | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/docs/troubleshooting-faq/faq.md b/docs/troubleshooting-faq/faq.md index b6cd3e31e..9f6601d77 100644 --- a/docs/troubleshooting-faq/faq.md +++ b/docs/troubleshooting-faq/faq.md @@ -130,6 +130,19 @@ When working with these `Seurat` objects, please bear in mind the following: * They were _not_ processed with a `Seurat` pipeline. They were processed using the same pipeline as all OpenScPCA objects (e.g., with `Bioconductor`), and then converted to a `Seurat` format - * Notably, they do contain the raw data counts, allowing you to perform normalization, dimension reduction, etc. with `Seurat` directly if you so choose + * Notably, they do contain the raw data counts, allowing you to perform normalization, dimension reduction, etc. with `Seurat` directly if you so choose * To be more consistent with `Seurat` analysis pipelines, gene names in these objects use gene symbols rather than Ensembl ids + +### The ScPCA data objects contain Ensembl ids, but I need gene symbols for my analysis. How should I perform this conversion? + +In an effort to keep this consistent across the OpenScPCA project, we provide functions to convert Ensembl ids to gene symbols in an R package we maintain called [`rOpenScPCA`](https://github.com/AlexsLemonade/rOpenScPCA/). + +This package has two particular functions to support this task: + +* `rOpenScPCA::sce_to_symbols()` + * This function converts row names in a `SingleCellExperiment` object from Ensembl ids to gene symbols +* `rOpenScPCA::ensembl_to_symbol()` + * This function converts a vector of Ensembl ids to a vector of gene symbols + +Please refer to these functions' help menus (e.g., `?rOpenScPCA::sce_to_symbols`) for additional information on how to use them. From 7d8c80446af02593becf804ac7eab7acc137bd05 Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Thu, 16 Jan 2025 12:15:06 -0500 Subject: [PATCH 3/9] fix bad spacing I noticed --- analyses/hello-clusters/README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/analyses/hello-clusters/README.md b/analyses/hello-clusters/README.md index b095a73c8..2f29f007f 100644 --- a/analyses/hello-clusters/README.md +++ b/analyses/hello-clusters/README.md @@ -56,10 +56,10 @@ renv::update("rOpenScPCA") ## Example notebooks 1. The `01_perform-evaluate-clustering.Rmd` notebook shows examples of: - - Performing clustering with `rOpenScPCA::calculate_clusters()` - - Evaluating clustering with `rOpenScPCA::calculate_silhouette()`, `rOpenScPCA::calculate_purity()`, and `rOpenScPCA::calculate_stability()` + - Performing clustering with `rOpenScPCA::calculate_clusters()` + - Evaluating clustering with `rOpenScPCA::calculate_silhouette()`, `rOpenScPCA::calculate_purity()`, and `rOpenScPCA::calculate_stability()` It also contains explanations for how to interpret cluster quality metrics. 2. The `02_compare-clustering-parameters.Rmd` notebook shows examples of: - - Performing clustering across a set of parameterizations with `rOpenScPCA::sweep_clusters()` - - Comparing and visualizing multiple sets of clustering results + - Performing clustering across a set of parameterizations with `rOpenScPCA::sweep_clusters()` + - Comparing and visualizing multiple sets of clustering results From 3f169840ede148c974acb130157f9a06b2624189 Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Thu, 16 Jan 2025 12:15:32 -0500 Subject: [PATCH 4/9] add draft of clusters section --- docs/troubleshooting-faq/faq.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/docs/troubleshooting-faq/faq.md b/docs/troubleshooting-faq/faq.md index 9f6601d77..c41f49d70 100644 --- a/docs/troubleshooting-faq/faq.md +++ b/docs/troubleshooting-faq/faq.md @@ -146,3 +146,17 @@ This package has two particular functions to support this task: * This function converts a vector of Ensembl ids to a vector of gene symbols Please refer to these functions' help menus (e.g., `?rOpenScPCA::sce_to_symbols`) for additional information on how to use them. + +### I noticed there are cluster assignments in the processed data files. Should I use those or re-cluster the data? + +All ScPCA data objects contain cluster assignments which were [calculated using an automated pipeline](https://scpca.readthedocs.io/en/stable/processing_information.html#processed-gene-expression-data). +Because the clustering parameters used in this automated pipeline were not tailored to any given dataset, we do not recommend relying on these clusters for downstream analysis. +Instead, we strongly recommend re-clustering the data _and_ evaluating your cluster assignments before using them. + +To support clustering analysis and evaluation, we provide several functions in an R package we maintain called [`rOpenScPCA`](https://github.com/AlexsLemonade/rOpenScPCA/) to accomplish the following tasks: + +* Perform graph-based clustering +* Evaluate clustering results with quality control metrics +* Calculate several sets of clustering results across parameter space + +We also provide an OpenScPCA analysis module [`hello-clusters`](https://github.com/AlexsLemonade/OpenScPCA-analysis/tree/main/analyses/hello-clusters) with example notebooks demonstrating how to use clustering functionality in `rOpenScPCA`. From 96443ed5554a0f191539d48be3cd2a9937e488d9 Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Thu, 16 Jan 2025 12:18:51 -0500 Subject: [PATCH 5/9] a few text cleanups --- docs/troubleshooting-faq/faq.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/docs/troubleshooting-faq/faq.md b/docs/troubleshooting-faq/faq.md index c41f49d70..f1ef90761 100644 --- a/docs/troubleshooting-faq/faq.md +++ b/docs/troubleshooting-faq/faq.md @@ -129,7 +129,7 @@ For more information on obtaining result files, please refer to the documentatio When working with these `Seurat` objects, please bear in mind the following: * They were _not_ processed with a `Seurat` pipeline. -They were processed using the same pipeline as all OpenScPCA objects (e.g., with `Bioconductor`), and then converted to a `Seurat` format +They were processed using the same pipeline as all OpenScPCA objects were (e.g., with `Bioconductor`), and then converted to `Seurat` format * Notably, they do contain the raw data counts, allowing you to perform normalization, dimension reduction, etc. with `Seurat` directly if you so choose * To be more consistent with `Seurat` analysis pipelines, gene names in these objects use gene symbols rather than Ensembl ids @@ -137,6 +137,7 @@ They were processed using the same pipeline as all OpenScPCA objects (e.g., with ### The ScPCA data objects contain Ensembl ids, but I need gene symbols for my analysis. How should I perform this conversion? In an effort to keep this consistent across the OpenScPCA project, we provide functions to convert Ensembl ids to gene symbols in an R package we maintain called [`rOpenScPCA`](https://github.com/AlexsLemonade/rOpenScPCA/). +Installation instructions are provided in the `rOpenScPCA` GitHub repository. This package has two particular functions to support this task: @@ -145,7 +146,7 @@ This package has two particular functions to support this task: * `rOpenScPCA::ensembl_to_symbol()` * This function converts a vector of Ensembl ids to a vector of gene symbols -Please refer to these functions' help menus (e.g., `?rOpenScPCA::sce_to_symbols`) for additional information on how to use them. +Please refer to these functions' help menus (e.g., `?rOpenScPCA::sce_to_symbols`) for additional information on their use. ### I noticed there are cluster assignments in the processed data files. Should I use those or re-cluster the data? @@ -156,7 +157,8 @@ Instead, we strongly recommend re-clustering the data _and_ evaluating your clus To support clustering analysis and evaluation, we provide several functions in an R package we maintain called [`rOpenScPCA`](https://github.com/AlexsLemonade/rOpenScPCA/) to accomplish the following tasks: * Perform graph-based clustering -* Evaluate clustering results with quality control metrics -* Calculate several sets of clustering results across parameter space +* Evaluate clustering results with several quality control metrics +* Calculate different sets of clustering results across parameter space in order to identify an optimal clustering scheme We also provide an OpenScPCA analysis module [`hello-clusters`](https://github.com/AlexsLemonade/OpenScPCA-analysis/tree/main/analyses/hello-clusters) with example notebooks demonstrating how to use clustering functionality in `rOpenScPCA`. +This module also provides instructions on how to install `rOpenScPCA`. From e3f7cfcc5a5a0676a6048a733c96f51f82275086 Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Thu, 16 Jan 2025 12:23:53 -0500 Subject: [PATCH 6/9] restore script name --- docs/troubleshooting-faq/faq.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/troubleshooting-faq/faq.md b/docs/troubleshooting-faq/faq.md index f1ef90761..5e4a721cd 100644 --- a/docs/troubleshooting-faq/faq.md +++ b/docs/troubleshooting-faq/faq.md @@ -2,7 +2,7 @@ ### Why didn't the sample/project I specified when running the data download script download? -First, we recommend using the `--dryrun` flag when running the [data download script](../getting-started/accessing-resources/getting-access-to-data.md#using-the-download-data-script) to check which files _would_ be downloaded. +First, we recommend using the `--dryrun` flag when running the [`download-data.py` script](../getting-started/accessing-resources/getting-access-to-data.md#using-the-download-data-script) to check which files _would_ be downloaded. This will confirm that there is nothing wrong with your internet connection and that you are properly [logged into your AWS profile](../technical-setup/environment-setup/configure-aws-cli.md#logging-in-to-a-new-session). If running the script with `--dryrun` states that _only_ the `DATA_USAGE.md` file is being downloaded, this means the data files you are attempting to download do not exist. From f12c243b6857a73e2c4e3fe9422793f0587baaa8 Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Thu, 16 Jan 2025 13:12:20 -0500 Subject: [PATCH 7/9] Apply suggestions from code review Co-authored-by: Joshua Shapiro --- docs/troubleshooting-faq/faq.md | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/docs/troubleshooting-faq/faq.md b/docs/troubleshooting-faq/faq.md index 5e4a721cd..837c6ec02 100644 --- a/docs/troubleshooting-faq/faq.md +++ b/docs/troubleshooting-faq/faq.md @@ -121,17 +121,18 @@ After running the module, results will generally be stored in `analysis/{module ### What if I want to use Seurat? -While [data downloads](../getting-started/accessing-resources/getting-access-to-data.md) are only available in `SingleCellExperiment` and `AnnData` format, `Seurat` versions of all objects (in [v5 assay format](https://satijalab.org/seurat/articles/seurat5_essential_commands)) are also available for use. +While [data downloads](../getting-started/accessing-resources/getting-access-to-data.md) are only available in `SingleCellExperiment` and `AnnData` format, `Seurat` versions of processed objects (in [v5 assay format](https://satijalab.org/seurat/articles/seurat5_essential_commands)) are also available for use. These files are part of the OpenScPCA results, associated with the module [`seurat-conversion`](https://github.com/AlexsLemonade/OpenScPCA-analysis/tree/main/analyses/seurat-conversion) which we wrote to convert the processed `SingleCellExperiment` objects to `Seurat` format. For more information on obtaining result files, please refer to the documentation for [the `download-results.py` script](../getting-started/accessing-resources/getting-access-to-data.md#accessing-scpca-module-results). When working with these `Seurat` objects, please bear in mind the following: -* They were _not_ processed with a `Seurat` pipeline. -They were processed using the same pipeline as all OpenScPCA objects were (e.g., with `Bioconductor`), and then converted to `Seurat` format - * Notably, they do contain the raw data counts, allowing you to perform normalization, dimension reduction, etc. with `Seurat` directly if you so choose -* To be more consistent with `Seurat` analysis pipelines, gene names in these objects use gene symbols rather than Ensembl ids +* These `Seurat` objects include the same content as the `SingleCellExperiment` objects that they are derived from. +This includes raw and normalized counts, annotations of highly variable genes, PCA and UMAP transformations, as well as cell and feature metadata. + * Note that all calculations were performed using `Bioconductor` packages, so values will differ from the results obtained using `Seurat` functions from the same raw data. + * If `Seurat`-derived values are required, processing steps may need to be repeated. +* To be more consistent with `Seurat` analysis pipelines, these objects use gene symbols rather than Ensembl ids as the row names and primary feature id. ### The ScPCA data objects contain Ensembl ids, but I need gene symbols for my analysis. How should I perform this conversion? @@ -141,12 +142,12 @@ Installation instructions are provided in the `rOpenScPCA` GitHub repository. This package has two particular functions to support this task: -* `rOpenScPCA::sce_to_symbols()` - * This function converts row names in a `SingleCellExperiment` object from Ensembl ids to gene symbols * `rOpenScPCA::ensembl_to_symbol()` * This function converts a vector of Ensembl ids to a vector of gene symbols +* `rOpenScPCA::sce_to_symbols()` + * This function converts row names in a `SingleCellExperiment` object from Ensembl ids to gene symbols -Please refer to these functions' help menus (e.g., `?rOpenScPCA::sce_to_symbols`) for additional information on their use. +Please refer to these functions' help pages (e.g., `?rOpenScPCA::sce_to_symbols`) for additional information on their use, including options for handling duplicate gene symbols. ### I noticed there are cluster assignments in the processed data files. Should I use those or re-cluster the data? @@ -160,5 +161,5 @@ To support clustering analysis and evaluation, we provide several functions in a * Evaluate clustering results with several quality control metrics * Calculate different sets of clustering results across parameter space in order to identify an optimal clustering scheme -We also provide an OpenScPCA analysis module [`hello-clusters`](https://github.com/AlexsLemonade/OpenScPCA-analysis/tree/main/analyses/hello-clusters) with example notebooks demonstrating how to use clustering functionality in `rOpenScPCA`. +We also provide an OpenScPCA analysis module [`hello-clusters`](https://github.com/AlexsLemonade/OpenScPCA-analysis/tree/main/analyses/hello-clusters) with example notebooks demonstrating how to use the clustering functionality in `rOpenScPCA`. This module also provides instructions on how to install `rOpenScPCA`. From 0ff6ce7aaaa6b3b0a6d7587e75bd32ba12c42d9a Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Thu, 16 Jan 2025 13:54:17 -0500 Subject: [PATCH 8/9] respond to reviews and try out one new phrasing bit --- docs/troubleshooting-faq/faq.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/docs/troubleshooting-faq/faq.md b/docs/troubleshooting-faq/faq.md index 837c6ec02..eb6331800 100644 --- a/docs/troubleshooting-faq/faq.md +++ b/docs/troubleshooting-faq/faq.md @@ -20,7 +20,7 @@ Data files in each release are organized on S3 as: {Release} ├── {Project ID} │ └── {Sample ID} - │  └── {Library files} + │ └── {Library files} ├── bulk_metadata.tsv (if applicable) ├── bulk_quant.tsv (if applicable) └── single_cell_metadata.tsv @@ -129,17 +129,16 @@ For more information on obtaining result files, please refer to the documentatio When working with these `Seurat` objects, please bear in mind the following: * These `Seurat` objects include the same content as the `SingleCellExperiment` objects that they are derived from. -This includes raw and normalized counts, annotations of highly variable genes, PCA and UMAP transformations, as well as cell and feature metadata. +This includes raw and normalized counts, annotations of highly variable genes, PCA and UMAP transformations, as well as cell and feature metadata. * Note that all calculations were performed using `Bioconductor` packages, so values will differ from the results obtained using `Seurat` functions from the same raw data. - * If `Seurat`-derived values are required, processing steps may need to be repeated. + * If your analysis requires fields created from `Seurat` processing pipelines, you will need to repeat those processing steps. * To be more consistent with `Seurat` analysis pipelines, these objects use gene symbols rather than Ensembl ids as the row names and primary feature id. ### The ScPCA data objects contain Ensembl ids, but I need gene symbols for my analysis. How should I perform this conversion? -In an effort to keep this consistent across the OpenScPCA project, we provide functions to convert Ensembl ids to gene symbols in an R package we maintain called [`rOpenScPCA`](https://github.com/AlexsLemonade/rOpenScPCA/). -Installation instructions are provided in the `rOpenScPCA` GitHub repository. - +In an effort to keep this consistent across the OpenScPCA project, we provide functions to convert Ensembl ids to gene symbols in an R package we maintain called `rOpenScPCA`. +Installation instructions are provided in the [`rOpenScPCA` GitHub repository](https://github.com/AlexsLemonade/rOpenScPCA/?tab=readme-ov-file#installation). This package has two particular functions to support this task: * `rOpenScPCA::ensembl_to_symbol()` @@ -147,7 +146,7 @@ This package has two particular functions to support this task: * `rOpenScPCA::sce_to_symbols()` * This function converts row names in a `SingleCellExperiment` object from Ensembl ids to gene symbols -Please refer to these functions' help pages (e.g., `?rOpenScPCA::sce_to_symbols`) for additional information on their use, including options for handling duplicate gene symbols. +Please refer to these functions' help pages (e.g., `?rOpenScPCA::sce_to_symbols`) for additional information on their use, including options for handling duplicate and/or missing gene symbols. ### I noticed there are cluster assignments in the processed data files. Should I use those or re-cluster the data? From cdee8a46354223ade5364319e69a2e1eafff8dd8 Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Fri, 17 Jan 2025 08:46:30 -0500 Subject: [PATCH 9/9] just use or --- docs/troubleshooting-faq/faq.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/troubleshooting-faq/faq.md b/docs/troubleshooting-faq/faq.md index eb6331800..d65360312 100644 --- a/docs/troubleshooting-faq/faq.md +++ b/docs/troubleshooting-faq/faq.md @@ -146,7 +146,7 @@ This package has two particular functions to support this task: * `rOpenScPCA::sce_to_symbols()` * This function converts row names in a `SingleCellExperiment` object from Ensembl ids to gene symbols -Please refer to these functions' help pages (e.g., `?rOpenScPCA::sce_to_symbols`) for additional information on their use, including options for handling duplicate and/or missing gene symbols. +Please refer to these functions' help pages (e.g., `?rOpenScPCA::sce_to_symbols`) for additional information on their use, including options for handling duplicate or missing gene symbols. ### I noticed there are cluster assignments in the processed data files. Should I use those or re-cluster the data?