Built on 2023-09-15, v0.1.37, dev

neurobioinfo · Sep 23, 2023 · 18ac1bb · 18ac1bb
1 parent ddb09ef
commit 18ac1bb
Show file tree

Hide file tree

Showing 55 changed files with 5,070 additions and 1,101 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/docs/Dataset1.md b/docs/Dataset1.md
diff --git a/docs/HTO.md b/docs/HTO.md
diff --git a/docs/SCRNA.md b/docs/SCRNA.md
diff --git a/docs/index.md b/docs/index.md
@@ -1,22 +1,32 @@
 # Welcome to scRNAbox's documentation!
-ScRNAbox is a single-cell RNA sequencing (scRNAseq) pipeline specifically designed for analyzing data under a High-Performance Computing (HPC) systems using the [Slurm Workload Manager](https://slurm.schedmd.com/). ScRNAbox provides two distinct, yet highly comparable Analysis Tracks:
+ScRNAbox is a single-cell RNA sequencing (scRNAseq) pipeline specifically designed for analyzing data under a High-Performance Computing (HPC) systems using the [Slurm Workload Manager](https://slurm.schedmd.com/). The scRNAbox pipeline incorporates nine Analytical Steps into a comprehensive scRNAseq analysis and provides the foundation for further investigations. The nine Analytical Steps are outlined below. 
+
+<img src="https://github.com/neurobioinfo/scrnabox/assets/110110777/eccddd8e-4ea2-4c1e-9427-8ba40e6418ba" width="550" height="100">
+
+The scRNAbox pipeline provides two distinct, yet highly comparable Analysis Tracks:
 
 1. **Standard scRNAseq**
 2. **Cell Hashtag scRNAseq**
 
-The Standard Analysis Track is designed for experiments where each sample is captured and sequenced separately, while the Cell Hashtag Analysis Track is designed for multiplexed experiments, whereby samples are tagged with sample-specific barcodes, pooled, and sequenced together. The Cell Hashtag Analysis Track is distinguished by an additional sample demultiplexing Step that assigns cells to their sample-of-origin via the sample-specific barcodes. 
+The **Standard Analysis Track** is designed for experiments where each sample is captured and sequenced separately, while the **Cell Hashtag Analysis Track** is designed for multiplexed experiments, whereby samples are tagged with sample-specific barcodes, pooled, and sequenced together. The Cell Hashtag Analysis Track is distinguished by an additional sample demultiplexing Step that assigns cells to their sample-of-origin via the sample-specific barcodes. 
 
 <img src="https://github.com/neurobioinfo/scrnabox/assets/110110777/3a6df83e-e104-45d2-9b04-fe246642c6a8" height="300"> 
 
-For instructions on how to run each Analytical Step of the [Standard scRNAseq](SCRNA.md) and [Cell Hashtag scRNAseq](HTO.md) Analysis Track please see the respective tutorials. For a demonstration that leverages the datasets used as the application cases in the manuscript please see [Dataset1: Smajic et al.](Dataset1.md) and [Datset2: Stoeckius et al.](Dataset2.md) for the Standard scRNAseq and Cell Hashtag scRNAseq Analysis Track, respectively.
+For a comprehenseive description of each Analytical Step, please see [Standard Analysis Track](SCRNA.md) and [Cell Hashtag Analysis Track](HTO.md). <br/>
+
+For a tutorial that leverages the datasets used as the application cases in our pre-print manuscript, please see [Standard Analysis: Midbrain dataset](Dataset1.md) and [Cell Hashtag Analysis: PBMC dataset](Dataset2.md).
+
+ - - - -
 
 ## Contents
 - [Installation](installation.md)
-- [Tutorial:]()
-    - [Standard scRNAseq](SCRNA.md)
-    - [Cell Hashtag scRNAseq](HTO.md)
+- Overview:
+    - [Standard Analysis Track](SCRNA.md)
+    - [Cell Hashtag Analysis Track](HTO.md)
+    - [Execution parameters](reference.md)
+    - [Outputs](outputs.md)
+- Tutorial   
+    - [Standard Analysis Track: Midbrain dataset](Dataset1.md)
+    - [Cell Hashtag Analysis Track: PBMC dataset](Dataset2.md)
     - [Processed Data](PROC.md)
-    - [Dataset1: Smajic et al.](Dataset1.md)
-    - [Datset2: Stoeckius et al.](Dataset2.md)
 - [FAQ](FAQ.md)
-- [Reference](reference.md)
diff --git a/docs/installation.md b/docs/installation.md
@@ -1,4 +1,14 @@
 # Installation
+To use the scRNAbox pipeline, the folowing must be installed on your High-Performance Computing (HPC) system:
+
+- [scrnabox.slurm](#scrnaboxslurm-installation)
+- [CellRanger](#cellranger-installation)
+- [R and R packages](#r-library-preparation-and-r-package-installation)
+
+ - - - -
+
+### scrnabox.slurm installation
+
 `scrnabox.slurm` is written in bash and can be used with any Slurm system. To download the latest version of `scrnabox.slurm` (v0.1.35) run the following command: 
 ```
 wget https://github.com/neurobioinfo/scrnabox/releases/download/v0.1.35/scrnabox.slurm.zip
@@ -7,50 +17,91 @@ unzip scrnabox.slurm.zip
 
 For a description of the options for running `scrnabox.slurm` run the following command:
 ```
-bash ./scrnabox.slurm/launch_scrnabox.sh -h 
+bash ~/scrnabox.slurm/launch_scrnabox.sh -h 
 ```
 
-`scrnabox.slurm` requires that `cellranger` and `R` are also installed on the HPC system. In addition, the following R packages must be loaded: `'Seurat','ggplot2', 'dplyr', 'foreach', 'doParallel', 'Matrix', 'DoubletFinder','cowplot','clustree'`. Then, install the `'scrnaboxR'` R package by running the following command: 
-```
-devtools::install_github("neurobioinfo/scrnabox/scrnaboxR")
+If the `scrnabox.slurm` has been installed properly, the above command should return the folllowing:
 ```
+        mandatory arguments:
+                -d  (--dir)  = Working directory (where all the outputs will be printed) (give full path)
+                --steps  =  Specify what steps, e.g., 2 to run just step 2, 2-4, run steps 2 through 4)
 
-Please note that all R packages must be loaded to a common R library. Shown below is an example of how to load packages into a common library in R.
-```
-R_LIB_PATH=“Path_to_R_library”
-.libPaths(R_LIB_PATH)
-devtools::install_github("neurobioinfo/scrnabox/scrnaboxR")
+        optional arguments:
+                -h  (--help)  = See helps regarding the pipeline options. 
+                --method  = Choose what scRNA method you want to use; use HTO  and SCRNA for for hashtag nad Standard scRNA, respectively. 
+                --nFeature_RNA_L  = Lower threshold of number of unique RNA transcripts for each cell, it filters nFeature_RNA > nFeature_RNA_L.  
+                --nFeature_RNA_U  = Upper threshold of number of unique RNA transcripts for each cell, it filters --nFeature_RNA_U.  
+                --nCount_RNA_L  = Lower threshold for nCount_RNA, it filters nCount_RNA > nCount_RNA_L   
+                --nCount_RNA_U  = Upper threshold for  nCount_RNA, it filters nCount_RNA < nCount_RNA_U  
+                --mitochondria_percent_L  = Lower threshold for the amount of mitochondrial transcript, it is in percent, mitochondria_percent > mitochondria_percent_L. 
+                --mitochondria_percent_U  = Upper threshold for the amount of mitochondrial transcript, it is in percent, mitochondria_percent < mitochondria_percent_U. 
+                --log10GenesPerUMI_U  = Upper threshold for the log number of genes per UMI for each cell, it is in percent,log10GenesPerUMI=log10(nFeature_RNA)/log10(nCount_RNA). mitochondria_percent < log10GenesPerUMI_U. 
+                --log10GenesPerUMI_L  = Lower threshold for the log number of genes per UMI for each cell, log10GenesPerUMI=log10(nFeature_RNA)/log10(nCount_RNA). mitochondria_percent > log10GenesPerUMI_L.  
+                --msd  = you can get the hashtag labels by running the following code 
+                --marker  = Find marker. 
+                --sinfo  = Do you need sample info? 
+                --fta  = FindTransferAnchors 
+                --enrich  = Annotation 
+                --dgelist  = creates a DGEListobject from a table of counts obtained from seurate objects. 
+                --genotype  = Run the genotype contrast. 
+                --celltype  = Run the Genotype-cell contrast. 
+                --cont  = You can directly call the contrast to the pipeline.  
+                --seulist                = You can directly call the list of seurat objects to the pipeline. 
 ```
+ - - - -
 
-Upon installing `scrnabox.slurm`,`cellranger`, `R`, and the required R packages, users can run the pipeline initiation Step and define their desired Analysis Track (SCRNA or HTO for Standard scRNAseq or Cell Hashtag scRNAseq, respectively) using the following command:
-```
-bash ./scrnabox.slurm/launch_scrnabox.sh \
--d ./working_directory \
---steps 0 \
---method SCRNA
-```
+### CellRanger installation
 
-After initiating the pipeline, the structure of the working directory should be as follows:
+For information regarding the installation of `CellRanger`, please visit the 10X Genomics [documentation](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/installation). If CellRanger is already installed on your HPC system, you may skip the CellRanger installation procedures.
 
-```
-├── working_directory
-    ├── job_info
-        ├── configs
-        ├── logs
-        ├── parameters
-```
-- The `configs/` directory contains the `scrnabox_config.ini` file which allows users to specify their job allocations (memory, threads, and walltime) for each Analytical Step using the Slurm Workload Manager <br /> 
-- The `logs/` directory records the events of each Analytical Step <br />
-- The `parameters/` directory contains adjustable, Step-specific text files which allow users to define the execution parameters for each Analytical Step <br />
+ - - - -
 
-Users must then navigate to the `scrnabox_config.ini` file in `~/working_directory/job_info/configs` to define the location of their R library (`R_LIB_PATH=`), their version of R (`R_VERSION=`), and the location of CellRanger (`MODULECELLRANGER=`). For example: 
+### R library preparation and R package installation
+Users must first install `R` onto their HPC system: 
 
 ```
-MODULECELLRANGER=mugqic/cellranger/5.0.1
-R_VERSION=4.2.1
-R_LIB_PATH=Path_to_R_library
+# install R
+module load r/4.2.1
+```
+Then, users should create a designated directory on their HPC system where the required R packages will be installed:
+
 ```
+# make common R library
+mkdir R_library
+cd R_library
 
+# open R
+R 
+
+# set common R library path
+R_LIB_PATH="~/R_library"
+.libPaths(R_LIB_PATH)
+
+# load packages
+library(Seurat)
+library(ggplot2)
+library(dplyr)
+library(foreach)
+library(doParallel)
+library(Matrix)
+library(DoubletFinder)
+library(cowplot)
+library(clustree)
+library(xlsx)
+library(enrichR)
+library(stringi)
+library(limma)
+library(tidyverse)
+library(edgeR)
+library(vctrs)
+library(RColorBrewer)
+library(fossil)
+library(openxlsx)
+library(stringr)
+library(ggpubr)
+devtools::install_github(“neurobioinfo/scrnabox/scrnaboxR”)
+```
+ - - - -
 Upon completing the installation procedures, users can proceed with the scRNAbox pipeline using either the [Standard scRNAseq Analysis Track](SCRNA.md) or [Cell Hashtag scRNAseq Analysis Track](HTO.md). 
 
 

diff --git a/docs/library_prep.md b/docs/library_prep.md
@@ -0,0 +1,72 @@
+Finally, in preparation for Step 1 (FASTQ pre-processing with CellRanger) users must create `library.csv` and `feature_ref.csv` files for each of their sequencing runs.<br />
+
+#### library.csv
+The `library.csv` file defines the necessary information of the FASTQ files for the experiment, including the gene expression and antibody assays. The structure of the `library.csv` file should be: <br />
+```
+fastqs,sample,library_type
+~/fastqs/,RUN1GEX,Gene Expression
+~/fastqs/,RUN1HTO,Antibody Capture
+```
+- The `fastqs` column defines the path to the directory that contains the FASTQ files for the experiment. <br /> 
+- The `sample` column defines the sample name of the corresponding FASTQ file. Please note that FASTQ files must be named according to standard CellRanger nomenclature. For example, "CTRL1_S1_L001_R1_001.fastq". For more information please visit CellRanger's [documentation](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/fastq-input). <br />
+- The `library_type` column defines the assay type. For the Cell Hashtag Analysis track, each sequencing run should have a "Gene Expression" and "Antibody Capture" assay. For more information, please visit CellRanger's [documentation]("https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/feature-bc-analysis") <br />
+
+For example, if the experiment comprises three sequencing runs the following steps should be taken: <br />
+
+1) Navigate to the working directory and create a `samples_info` folder: <br />
+```
+cd ~/working_directory
+mkdir samples_info
+```
+2) Navigate to the `samples_info` folder and create a folder for each sequencing run: <br />
+```
+cd samples_info
+mkdir run1
+mkdir run2
+mkdir run3
+```
+3) Navigate to the folder for each sequencing and create the `library.csv` file. <br />
+
+After performing steps 1-3 above, the structure of the samples_info folder for an experiment with three sequencing runs should be:
+```
+├── working_directory
+    ├── samples_info
+        ├── run1
+            ├── library.csv
+        ├── run2
+            ├── library.csv
+        ├── run3
+            ├── library.csv
+```
+#### feature_ref.csv
+The `feature_ref.csv` file defines the necessary information for processing the sample-specific barcodes that will eventually be used to demultiplex the pooled samples. For example, if there are four samples pooled together with four unique barcode identifiers, the structure of the `feature_ref.csv` file should be:
+```
+id,name,read,pattern,sequence,feature_type
+Hash1,B0251_TotalSeqB,R2,5PNNNNNNNNNN(BC),GTCAACTCTTTAGCG,Antibody Capture
+Hash2,B0252_TotalSeqB,R2,5PNNNNNNNNNN(BC),TGATGGCCTATTGGG,Antibody Capture
+Hash3,B0253_TotalSeqB,R2,5PNNNNNNNNNN(BC),TTCCGCCTCTCTTTG,Antibody Capture
+Hash4,B0254_TotalSeqB,R2,5PNNNNNNNNNN(BC),AGTAAGTTCAGCGTA,Antibody Capture
+```
+- The `id` column defines the barcode ID which will be used to track the feature counts. <br /> 
+- The `name` column defines the arbitrary name for the barcode identifier. <br /> 
+- The `read` column defines which RNA sequencing read contains the barcode sequence. This value Will be either R1 or R2.<br /> 
+- The `pattern` column defines the pattern of the barcode identifiers. For more information please visit the 10X Genomics [documentation](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/feature-bc-analysis#pattern)<br /> 
+- The `sequence` column defines nucleotide sequence associated with the barcode identifier.<br /> 
+- The `feature_type` column defines the type of feature used for sample identification. Please ensure that the feature_type in the `feature_ref.csv` file matches a library_type in the `library.csv` file.  <br /> 
+
+For more information regarding the preparation of the `feature_ref.csv`, please see CellRanger's [documentation](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/feature-bc-analysis).
+
+`feature_ref.csv` files can be prepared the same way as the `library.csv` files. After producing the `feature_ref.csv` for each sequncing run, the structure of the samples_info folder for an experiment with three sequencing runs should be:
+```
+├── working_directory
+    ├── samples_info
+        ├── run1
+            ├── library.csv
+            ├── feature_ref.csv
+        ├── run2
+            ├── library.csv
+            ├── feature_ref.csv
+        ├── run3
+            ├── library.csv
+            ├── feature_ref.csv
+```