Merge pull request #54 from friendsofstrandseq/dev

2.2.5 with Dockerfile
friendsofstrandseq · Jan 10, 2024 · 10cdf1c · 10cdf1c
2 parents c98f8cf + a245337
commit 10cdf1c
Show file tree

Hide file tree

Showing 3 changed files with 419 additions and 101 deletions.
diff --git a/docs/workshop.md b/docs/workshop.md
@@ -1,14 +1,13 @@
 # Mosaicatcher workshop
 
+## AFAC
 
 Context: You are working on a fancy project and Jan is suggesting at some point to generate some Strand-seq data
 
 That's the first time you are working with Strand-seq and you are starting to panic
 
 But You remember that you heard that a complex tool was developed in the lab in order to process in a systematic way Strand-seq data: tadam MOSAICATCHER
 
-
-
 Prerequisite: I asked you to select a sample to process during today's workshop
 TD: feedback on how they traced back the name of the sample, the associated run/flowcell, the date when it was sequenced ...
 
@@ -17,9 +16,7 @@ So here's the plan for today:
 Small intro (~20/30 min) about Mosaicatcher, the different steps, options, branches, possibilities
 
 Outputs examples
-    SV trustfullness
-
-
+SV trustfullness
 
 Web report analysis of RPE-MIXTURE
 
@@ -43,28 +40,39 @@ Hands on: pipeline install, module load, test data execution
 
     vim scNOVA_input_user/input_subclonality.txt
 
-
-
-
-
-
 Then trigger the pipeline on YOUR data
 
 Once this is running, web report analysis together with questions
 
 Then, Strand-scape
-    Still in beta, some microservices instable, main application for QC and web report consultation
-    Remove MC trigger, too complex in the backend
-    Cell selection with username
+Still in beta, some microservices instable, main application for QC and web report consultation
+Remove MC trigger, too complex in the backend
+Cell selection with username
 
     cp --preserve=timestamps FROM_ TO_
 
     snakemake ...
 
+## Technical prerequisites
 
+- SSHFS/SFTP connection to visualise/download/access files created (WinSCP/FileZilla/Cyberduck)
+- Functional terminal connected to the EMBL cluster (if not follow SSH key configuration here: https://www.embl.org/internal-information/it-services/hpc-resources-heidelberg/)
+- Have a workspace on /g/korbel
 
+## Workshop prerequisites
 
----
+- Pick a sample name to be processed
+- Download this MosaiCatcher report: https://oc.embl.de/index.php/s/WBgrzBjyzdYdVJA/download
+
+## EMBL cheatsheet
+
+### connect to seneca
+
+ssh [email protected]
+
+### connect to login nodes
+
+ssh USERNAME@login0[1,2,3,4].embl.de (login01 to login04)
 
 **ℹ️ Important Note**
 
@@ -86,6 +94,19 @@ Snakemake important arguments/options
 --rerun-triggers
 --touch
 
+## MosaiCatcher important files
+
+- Counts: PARENT_FOLDER/SAMPLE_NAME/counts/SAMPLE_NAME.txt.raw.gz
+- Counts statistics: PARENT_FOLDER/SAMPLE_NAME/counts/SAMPLE_NAME.info_raw
+- Ashleys predictions: PARENT_FOLDER/SAMPLE_NAME/cell_selection/labels.tsv
+- Counts plot: PARENT_FOLDER/SAMPLE_NAME/plots/CountComplete.raw.pdf
+- Count normalied plot: PARENT_FOLDER/SAMPLE_NAME/plots/CountComplete.normalised.pdf
+- Phased W/C regions: PARENT_FOLDER/SAMPLE_NAME/strandphaser/strandphaser_phased_haps_merged.txt
+- SV calls (stringent): PARENT_FOLDER/SAMPLE_NAME/mosaiclassifier/sv_calls/stringent_filterTRUE.tsv
+- SV calls (lenient): PARENT_FOLDER/SAMPLE_NAME/mosaiclassifier/sv_calls/lenient_filterFALSE.tsv
+- Plots folder: PARENT_FOLDER/SAMPLE_NAME/plots/
+- scNOVA outputs:
+
 ## CLI usage of the pipeline
 
 ### Quick Start
@@ -102,147 +123,141 @@ Notes
 - Config definition is crucial / via command line or via YAML file, will define where to stop, which mode, which branch, which options to be used
 - Profile
 
-2.
+2. Load snakemake
 
 A. Use module load OR create a dedicated conda environment
 
 ```bash
-module load snakemake ...
+module load snakemake/7.32.4-foss-2022b
 ```
 
-## <!--
-
-**ℹ️ Note**
-
-- Please be careful of your conda/mamba setup, if you applied specific constraints/modifications to your system, this could lead to some versions discrepancies.
-- mamba is usually preferred but might not be installed by default on a shared cluster environment
-
----
+Give a look at the folder structure:
 
 ```bash
-conda create -n snakemake -c bioconda -c conda-forge -c defaults -c anaconda snakemake
+tree -h .tests/data_CHR17
 ```
 
-B. Activate the dedicated conda environment
+Similar to this
+
+Parent_folder
+|-- Sample_1
+| `-- fastq
+|       |-- Cell_01.1.fastq.gz
+|       |-- Cell_01.2.fastq.gz
+|       |-- Cell_02.1.fastq.gz
+|       |-- Cell_02.2.fastq.gz
+|       |-- Cell_03.1.fastq.gz
+|       |-- Cell_03.2.fastq.gz
+|       |-- Cell_04.1.fastq.gz
+|       `-- Cell_04.2.fastq.gz
+|
+`-- Sample_2
+    `-- fastq
+|-- Cell_21.1.fastq.gz
+|-- Cell_21.2.fastq.gz
+|-- Cell_22.1.fastq.gz
+|-- Cell_22.2.fastq.gz
+|-- Cell_23.1.fastq.gz
+|-- Cell_23.2.fastq.gz
+|-- Cell_24.1.fastq.gz
+`-- Cell_24.2.fastq.gz
 
-````bash
-conda activate snakemake
-``` -->
+````
 
-**Reminder:** You will need to verify that this conda environment is activated and provide the right snakemake before each execution (`which snakemake` command should output like \<FOLDER>/\<USER>/[ana|mini]conda3/envs/snakemake/bin/snakemake)
 
 
-3. Run on example data on only one small chromosome (`<disk>` must be replaced by your disk letter/name)
+1. Run on example data on only one small chromosome (`<disk>` must be replaced by your disk letter/name)
 
 First using the `--dry-run` option of snakemake to make sure the Graph of Execution is properly connected. (In combination with `--dry-run`, we use the `local/conda` profile as snakemake still present a bug by looking for the singularity container).
 
+
+
 ```bash
 snakemake \
     --cores 6 \
     --configfile .tests/config/simple_config.yaml \
-    --profile workflow/snakemake_profiles/local/conda/ \
-    --dry-run
+    --config \
+        data_location=.tests/data_CHR17 \ # DATA LOCATION
+        ashleys_pipeline=True \ # DOWNLOAD & TRIGGER ASHLEYS QC UPSTREAM MODULE
+        ashleys_pipeline_only=True \ # STOP AFTER ASHLEYS QC - VALIDATION PURPOSE
+        multistep_normalisation=True \ # TRIGGER MARCO'S MULTISTEP NORMALISATION
+        MultiQC=True \ # TRIGGER samtools stats, FastQC & MultiQC reporting
+    --profile workflow/snakemake_profiles/local/conda/ \ # EXECUTION PROFILE TO BE USED
+    --dry-run # ONLY CHECK IF EVERYTHING CONNECTS WELL AND READY FOR COMPUTING
 ````
 
 If no error message, you are good to go!
 
 ```bash
-# Snakemake Profile: if singularity installed: workflow/snakemake_profiles/local/conda_singularity/
-# Snakemake Profile: if singularity NOT installed: workflow/snakemake_profiles/local/conda/
 snakemake \
     --cores 6 \
     --configfile .tests/config/simple_config.yaml \
-    --profile workflow/snakemake_profiles/local/conda_singularity/ \
-    --singularity-args "-B /disk:/disk"
+    --config \
+        data_location=.tests/data_CHR17 \ # DATA LOCATION
+        ashleys_pipeline=True \ # DOWNLOAD & TRIGGER ASHLEYS QC UPSTREAM MODULE
+        ashleys_pipeline_only=True \ # STOP AFTER ASHLEYS QC - VALIDATION PURPOSE
+        multistep_normalisation=True \ # TRIGGER MARCO'S MULTISTEP NORMALISATION
+        MultiQC=True \ # TRIGGER samtools stats, FastQC & MultiQC reporting
+    --profile workflow/snakemake_profiles/HPC/slurm_EMBL/ \
+    --cores 24
 ```
 
-4. Generate report on example data
-
 ```bash
-snakemake \
-    --cores 6 \
-    --configfile .tests/config/simple_config.yaml \
-    --profile workflow/snakemake_profiles/local/conda_singularity/ \
-    --singularity-args "-B /disk:/disk" \
-    --report report.zip \
-    --report-stylesheet workflow/report/custom-stylesheet.css
+cat .tests/data_CHR17/RPE-BM510/counts/RPE-BM510.info_raw
+zcat .tests/data_CHR17/RPE-BM510/counts/RPE-BM510.txt.raw.gz | less
+cat .tests/data_CHR17/RPE-BM510/cell_selection/labels.tsv
 ```
 
----
+Look at the plots
 
-**ℹ️ Note**
-
-- Steps 0 - 2 are required only during first execution
-- After the first execution, do not forget to go in the git repository and to activate the snakemake environment
+.tests/data_CHR17/RPE-BM510/plots
 
----
-
----
-
-**ℹ️ Note for 🇪🇺 EMBL users**
-
-- Use the following profile to run on EMBL cluster: `--profile workflow/snakemake_profiles/HPC/slurm_EMBL`
-
----
-
-## 🔬 Start running your own analysis
-
-Following commands show you an example using local execution (not HPC or cloud)
-
-1. Start running your own Strand-Seq analysis
+REPORT
 
 ```bash
 snakemake \
-    --cores <N> \
-    --config \
-        data_location=<INPUT_DATA_FOLDER> \
-    --profile workflow/snakemake_profiles/local/conda_singularity/
-
-```
-
-2. Generate report
-
-```bash
-snakemake \
-    --cores <N> \
+    --cores 6 \
+    --configfile .tests/config/simple_config.yaml \
     --config \
-        data_location=<INPUT_DATA_FOLDER> \
-    --profile workflow/snakemake_profiles/local/conda_singularity/ \
-    --report <INPUT_DATA_FOLDER>/<REPORT.zip> \
+        data_location=.tests/data_CHR17 \ # DATA LOCATION
+        ashleys_pipeline=True \ # DOWNLOAD & TRIGGER ASHLEYS QC UPSTREAM MODULE
+        ashleys_pipeline_only=False \ # STOP AFTER ASHLEYS QC - VALIDATION PURPOSE
+        multistep_normalisation=True \ # TRIGGER MARCO'S MULTISTEP NORMALISATION
+        MultiQC=True \ # TRIGGER samtools stats, FastQC & MultiQC reporting
+    --profile workflow/snakemake_profiles/HPC/slurm_EMBL/ \
+    --cores 24 \
+    --report TEST_DATA_REPORT.zip \
     --report-stylesheet workflow/report/custom-stylesheet.css
 ```
 
-## System requirements
-
-This workflow is meant to be run in a Unix-based operating system (tested on Ubuntu 18.04 & CentOS 7).
-
-Minimum system requirements vary based on the use case. We highly recommend running it in a server environment with 32+GB RAM and 12+ cores.
+Questions???
 
-- [Conda install instructions](https://conda.io/miniconda.html)
-- [Singularity install instructions](https://sylabs.io/guides/3.0/user-guide/quick_start.html#quick-installation-steps)
 
-## Detailed usage
 
-### 🐍 1. Mosaicatcher basic conda environment install
 
-MosaiCatcher leverages snakemake built-in features such as execution within container and conda predefined modular environments. That's why it is only necessary to create an environment that relies on [snakemake](https://github.com/snakemake/snakemake) (to execute the pipeline) and [pandas](https://github.com/pandas-dev/pandas) (to handle basic configuration). If you plan to generate HTML Web report including plots, it is also necessary to install [imagemagick](https://github.com/ImageMagick/ImageMagick).
+SCNOVA
 
-If possible, it is also highly recommended to install and use `mamba` package manager instead of `conda`, which is much more efficient.
+mkdir -p .tests/data_CHR17/RPE-BM510/scNOVA_input_user
+awk 'BEGIN {FS=OFS="\t"} NR==1 {print "Filename", "Subclonality"} NR>1 && $2==1 {sub(/\.sort\.mdup\.bam/, "", $1); print $1, "clone"}' .tests/data_CHR17/RPE-BM510/cell_selection/labels.tsv > .tests/data_CHR17/RPE-BM510/scNOVA_input_user/input_subclonality.txt
 
 ```bash
-conda install -c conda-forge mamba
-mamba create -n snakemake -c bioconda -c conda-forge -c defaults -c anaconda snakemake
-conda activate mosaicatcher_env
+snakemake \
+    --cores 6 \
+    --configfile .tests/config/simple_config.yaml \
+    --config \
+        data_location=.tests/data_CHR17 \ # DATA LOCATION
+        ashleys_pipeline=True \ # DOWNLOAD & TRIGGER ASHLEYS QC UPSTREAM MODULE
+        ashleys_pipeline_only=False \ # CONTINUES AFTER ASHLEYS QC - VALIDATION PURPOSE
+        multistep_normalisation=True \ # TRIGGER MARCO'S MULTISTEP NORMALISATION
+        MultiQC=True \ # TRIGGER samtools stats, FastQC & MultiQC reporting
+        scNOVA=True \
+    --profile workflow/snakemake_profiles/HPC/slurm_EMBL/ \
+    --cores 24
 ```
 
-### ⤵️ 2. Clone repository & go into workflow directory
 
-After cloning the repo, go into the `workflow` directory which correspond to the pipeline entry point.
+########################################################################
 
-```bash
-git clone --recurse-submodules https://github.com/friendsofstrandseq/mosaicatcher-pipeline.git
-cd mosaicatcher-pipeline
-```
 
 ### ⚙️ 3. MosaiCatcher execution (without preprocessing)