-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #54 from friendsofstrandseq/dev
2.2.5 with Dockerfile
- Loading branch information
Showing
3 changed files
with
419 additions
and
101 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,13 @@ | ||
# Mosaicatcher workshop | ||
|
||
## AFAC | ||
|
||
Context: You are working on a fancy project and Jan is suggesting at some point to generate some Strand-seq data | ||
|
||
That's the first time you are working with Strand-seq and you are starting to panic | ||
|
||
But You remember that you heard that a complex tool was developed in the lab in order to process in a systematic way Strand-seq data: tadam MOSAICATCHER | ||
|
||
|
||
|
||
Prerequisite: I asked you to select a sample to process during today's workshop | ||
TD: feedback on how they traced back the name of the sample, the associated run/flowcell, the date when it was sequenced ... | ||
|
||
|
@@ -17,9 +16,7 @@ So here's the plan for today: | |
Small intro (~20/30 min) about Mosaicatcher, the different steps, options, branches, possibilities | ||
|
||
Outputs examples | ||
SV trustfullness | ||
|
||
|
||
SV trustfullness | ||
|
||
Web report analysis of RPE-MIXTURE | ||
|
||
|
@@ -43,28 +40,39 @@ Hands on: pipeline install, module load, test data execution | |
|
||
vim scNOVA_input_user/input_subclonality.txt | ||
|
||
|
||
|
||
|
||
|
||
|
||
Then trigger the pipeline on YOUR data | ||
|
||
Once this is running, web report analysis together with questions | ||
|
||
Then, Strand-scape | ||
Still in beta, some microservices instable, main application for QC and web report consultation | ||
Remove MC trigger, too complex in the backend | ||
Cell selection with username | ||
Still in beta, some microservices instable, main application for QC and web report consultation | ||
Remove MC trigger, too complex in the backend | ||
Cell selection with username | ||
|
||
cp --preserve=timestamps FROM_ TO_ | ||
|
||
snakemake ... | ||
|
||
## Technical prerequisites | ||
|
||
- SSHFS/SFTP connection to visualise/download/access files created (WinSCP/FileZilla/Cyberduck) | ||
- Functional terminal connected to the EMBL cluster (if not follow SSH key configuration here: https://www.embl.org/internal-information/it-services/hpc-resources-heidelberg/) | ||
- Have a workspace on /g/korbel | ||
|
||
## Workshop prerequisites | ||
|
||
--- | ||
- Pick a sample name to be processed | ||
- Download this MosaiCatcher report: https://oc.embl.de/index.php/s/WBgrzBjyzdYdVJA/download | ||
|
||
## EMBL cheatsheet | ||
|
||
### connect to seneca | ||
|
||
ssh [email protected] | ||
|
||
### connect to login nodes | ||
|
||
ssh USERNAME@login0[1,2,3,4].embl.de (login01 to login04) | ||
|
||
**ℹ️ Important Note** | ||
|
||
|
@@ -86,6 +94,19 @@ Snakemake important arguments/options | |
--rerun-triggers | ||
--touch | ||
|
||
## MosaiCatcher important files | ||
|
||
- Counts: PARENT_FOLDER/SAMPLE_NAME/counts/SAMPLE_NAME.txt.raw.gz | ||
- Counts statistics: PARENT_FOLDER/SAMPLE_NAME/counts/SAMPLE_NAME.info_raw | ||
- Ashleys predictions: PARENT_FOLDER/SAMPLE_NAME/cell_selection/labels.tsv | ||
- Counts plot: PARENT_FOLDER/SAMPLE_NAME/plots/CountComplete.raw.pdf | ||
- Count normalied plot: PARENT_FOLDER/SAMPLE_NAME/plots/CountComplete.normalised.pdf | ||
- Phased W/C regions: PARENT_FOLDER/SAMPLE_NAME/strandphaser/strandphaser_phased_haps_merged.txt | ||
- SV calls (stringent): PARENT_FOLDER/SAMPLE_NAME/mosaiclassifier/sv_calls/stringent_filterTRUE.tsv | ||
- SV calls (lenient): PARENT_FOLDER/SAMPLE_NAME/mosaiclassifier/sv_calls/lenient_filterFALSE.tsv | ||
- Plots folder: PARENT_FOLDER/SAMPLE_NAME/plots/ | ||
- scNOVA outputs: | ||
|
||
## CLI usage of the pipeline | ||
|
||
### Quick Start | ||
|
@@ -102,147 +123,141 @@ Notes | |
- Config definition is crucial / via command line or via YAML file, will define where to stop, which mode, which branch, which options to be used | ||
- Profile | ||
|
||
2. | ||
2. Load snakemake | ||
|
||
A. Use module load OR create a dedicated conda environment | ||
|
||
```bash | ||
module load snakemake ... | ||
module load snakemake/7.32.4-foss-2022b | ||
``` | ||
|
||
## <!-- | ||
|
||
**ℹ️ Note** | ||
|
||
- Please be careful of your conda/mamba setup, if you applied specific constraints/modifications to your system, this could lead to some versions discrepancies. | ||
- mamba is usually preferred but might not be installed by default on a shared cluster environment | ||
|
||
--- | ||
Give a look at the folder structure: | ||
|
||
```bash | ||
conda create -n snakemake -c bioconda -c conda-forge -c defaults -c anaconda snakemake | ||
tree -h .tests/data_CHR17 | ||
``` | ||
|
||
B. Activate the dedicated conda environment | ||
Similar to this | ||
|
||
Parent_folder | ||
|-- Sample_1 | ||
| `-- fastq | ||
| |-- Cell_01.1.fastq.gz | ||
| |-- Cell_01.2.fastq.gz | ||
| |-- Cell_02.1.fastq.gz | ||
| |-- Cell_02.2.fastq.gz | ||
| |-- Cell_03.1.fastq.gz | ||
| |-- Cell_03.2.fastq.gz | ||
| |-- Cell_04.1.fastq.gz | ||
| `-- Cell_04.2.fastq.gz | ||
| | ||
`-- Sample_2 | ||
`-- fastq | ||
|-- Cell_21.1.fastq.gz | ||
|-- Cell_21.2.fastq.gz | ||
|-- Cell_22.1.fastq.gz | ||
|-- Cell_22.2.fastq.gz | ||
|-- Cell_23.1.fastq.gz | ||
|-- Cell_23.2.fastq.gz | ||
|-- Cell_24.1.fastq.gz | ||
`-- Cell_24.2.fastq.gz | ||
|
||
````bash | ||
conda activate snakemake | ||
``` --> | ||
```` | ||
**Reminder:** You will need to verify that this conda environment is activated and provide the right snakemake before each execution (`which snakemake` command should output like \<FOLDER>/\<USER>/[ana|mini]conda3/envs/snakemake/bin/snakemake) | ||
3. Run on example data on only one small chromosome (`<disk>` must be replaced by your disk letter/name) | ||
1. Run on example data on only one small chromosome (`<disk>` must be replaced by your disk letter/name) | ||
First using the `--dry-run` option of snakemake to make sure the Graph of Execution is properly connected. (In combination with `--dry-run`, we use the `local/conda` profile as snakemake still present a bug by looking for the singularity container). | ||
```bash | ||
snakemake \ | ||
--cores 6 \ | ||
--configfile .tests/config/simple_config.yaml \ | ||
--profile workflow/snakemake_profiles/local/conda/ \ | ||
--dry-run | ||
--config \ | ||
data_location=.tests/data_CHR17 \ # DATA LOCATION | ||
ashleys_pipeline=True \ # DOWNLOAD & TRIGGER ASHLEYS QC UPSTREAM MODULE | ||
ashleys_pipeline_only=True \ # STOP AFTER ASHLEYS QC - VALIDATION PURPOSE | ||
multistep_normalisation=True \ # TRIGGER MARCO'S MULTISTEP NORMALISATION | ||
MultiQC=True \ # TRIGGER samtools stats, FastQC & MultiQC reporting | ||
--profile workflow/snakemake_profiles/local/conda/ \ # EXECUTION PROFILE TO BE USED | ||
--dry-run # ONLY CHECK IF EVERYTHING CONNECTS WELL AND READY FOR COMPUTING | ||
```` | ||
|
||
If no error message, you are good to go! | ||
|
||
```bash | ||
# Snakemake Profile: if singularity installed: workflow/snakemake_profiles/local/conda_singularity/ | ||
# Snakemake Profile: if singularity NOT installed: workflow/snakemake_profiles/local/conda/ | ||
snakemake \ | ||
--cores 6 \ | ||
--configfile .tests/config/simple_config.yaml \ | ||
--profile workflow/snakemake_profiles/local/conda_singularity/ \ | ||
--singularity-args "-B /disk:/disk" | ||
--config \ | ||
data_location=.tests/data_CHR17 \ # DATA LOCATION | ||
ashleys_pipeline=True \ # DOWNLOAD & TRIGGER ASHLEYS QC UPSTREAM MODULE | ||
ashleys_pipeline_only=True \ # STOP AFTER ASHLEYS QC - VALIDATION PURPOSE | ||
multistep_normalisation=True \ # TRIGGER MARCO'S MULTISTEP NORMALISATION | ||
MultiQC=True \ # TRIGGER samtools stats, FastQC & MultiQC reporting | ||
--profile workflow/snakemake_profiles/HPC/slurm_EMBL/ \ | ||
--cores 24 | ||
``` | ||
|
||
4. Generate report on example data | ||
|
||
```bash | ||
snakemake \ | ||
--cores 6 \ | ||
--configfile .tests/config/simple_config.yaml \ | ||
--profile workflow/snakemake_profiles/local/conda_singularity/ \ | ||
--singularity-args "-B /disk:/disk" \ | ||
--report report.zip \ | ||
--report-stylesheet workflow/report/custom-stylesheet.css | ||
cat .tests/data_CHR17/RPE-BM510/counts/RPE-BM510.info_raw | ||
zcat .tests/data_CHR17/RPE-BM510/counts/RPE-BM510.txt.raw.gz | less | ||
cat .tests/data_CHR17/RPE-BM510/cell_selection/labels.tsv | ||
``` | ||
|
||
--- | ||
Look at the plots | ||
|
||
**ℹ️ Note** | ||
|
||
- Steps 0 - 2 are required only during first execution | ||
- After the first execution, do not forget to go in the git repository and to activate the snakemake environment | ||
.tests/data_CHR17/RPE-BM510/plots | ||
|
||
--- | ||
|
||
--- | ||
|
||
**ℹ️ Note for 🇪🇺 EMBL users** | ||
|
||
- Use the following profile to run on EMBL cluster: `--profile workflow/snakemake_profiles/HPC/slurm_EMBL` | ||
|
||
--- | ||
|
||
## 🔬 Start running your own analysis | ||
|
||
Following commands show you an example using local execution (not HPC or cloud) | ||
|
||
1. Start running your own Strand-Seq analysis | ||
REPORT | ||
|
||
```bash | ||
snakemake \ | ||
--cores <N> \ | ||
--config \ | ||
data_location=<INPUT_DATA_FOLDER> \ | ||
--profile workflow/snakemake_profiles/local/conda_singularity/ | ||
|
||
``` | ||
|
||
2. Generate report | ||
|
||
```bash | ||
snakemake \ | ||
--cores <N> \ | ||
--cores 6 \ | ||
--configfile .tests/config/simple_config.yaml \ | ||
--config \ | ||
data_location=<INPUT_DATA_FOLDER> \ | ||
--profile workflow/snakemake_profiles/local/conda_singularity/ \ | ||
--report <INPUT_DATA_FOLDER>/<REPORT.zip> \ | ||
data_location=.tests/data_CHR17 \ # DATA LOCATION | ||
ashleys_pipeline=True \ # DOWNLOAD & TRIGGER ASHLEYS QC UPSTREAM MODULE | ||
ashleys_pipeline_only=False \ # STOP AFTER ASHLEYS QC - VALIDATION PURPOSE | ||
multistep_normalisation=True \ # TRIGGER MARCO'S MULTISTEP NORMALISATION | ||
MultiQC=True \ # TRIGGER samtools stats, FastQC & MultiQC reporting | ||
--profile workflow/snakemake_profiles/HPC/slurm_EMBL/ \ | ||
--cores 24 \ | ||
--report TEST_DATA_REPORT.zip \ | ||
--report-stylesheet workflow/report/custom-stylesheet.css | ||
``` | ||
|
||
## System requirements | ||
|
||
This workflow is meant to be run in a Unix-based operating system (tested on Ubuntu 18.04 & CentOS 7). | ||
|
||
Minimum system requirements vary based on the use case. We highly recommend running it in a server environment with 32+GB RAM and 12+ cores. | ||
Questions??? | ||
|
||
- [Conda install instructions](https://conda.io/miniconda.html) | ||
- [Singularity install instructions](https://sylabs.io/guides/3.0/user-guide/quick_start.html#quick-installation-steps) | ||
|
||
## Detailed usage | ||
|
||
### 🐍 1. Mosaicatcher basic conda environment install | ||
|
||
MosaiCatcher leverages snakemake built-in features such as execution within container and conda predefined modular environments. That's why it is only necessary to create an environment that relies on [snakemake](https://github.com/snakemake/snakemake) (to execute the pipeline) and [pandas](https://github.com/pandas-dev/pandas) (to handle basic configuration). If you plan to generate HTML Web report including plots, it is also necessary to install [imagemagick](https://github.com/ImageMagick/ImageMagick). | ||
SCNOVA | ||
|
||
If possible, it is also highly recommended to install and use `mamba` package manager instead of `conda`, which is much more efficient. | ||
mkdir -p .tests/data_CHR17/RPE-BM510/scNOVA_input_user | ||
awk 'BEGIN {FS=OFS="\t"} NR==1 {print "Filename", "Subclonality"} NR>1 && $2==1 {sub(/\.sort\.mdup\.bam/, "", $1); print $1, "clone"}' .tests/data_CHR17/RPE-BM510/cell_selection/labels.tsv > .tests/data_CHR17/RPE-BM510/scNOVA_input_user/input_subclonality.txt | ||
|
||
```bash | ||
conda install -c conda-forge mamba | ||
mamba create -n snakemake -c bioconda -c conda-forge -c defaults -c anaconda snakemake | ||
conda activate mosaicatcher_env | ||
snakemake \ | ||
--cores 6 \ | ||
--configfile .tests/config/simple_config.yaml \ | ||
--config \ | ||
data_location=.tests/data_CHR17 \ # DATA LOCATION | ||
ashleys_pipeline=True \ # DOWNLOAD & TRIGGER ASHLEYS QC UPSTREAM MODULE | ||
ashleys_pipeline_only=False \ # CONTINUES AFTER ASHLEYS QC - VALIDATION PURPOSE | ||
multistep_normalisation=True \ # TRIGGER MARCO'S MULTISTEP NORMALISATION | ||
MultiQC=True \ # TRIGGER samtools stats, FastQC & MultiQC reporting | ||
scNOVA=True \ | ||
--profile workflow/snakemake_profiles/HPC/slurm_EMBL/ \ | ||
--cores 24 | ||
``` | ||
|
||
### ⤵️ 2. Clone repository & go into workflow directory | ||
|
||
After cloning the repo, go into the `workflow` directory which correspond to the pipeline entry point. | ||
######################################################################## | ||
|
||
```bash | ||
git clone --recurse-submodules https://github.com/friendsofstrandseq/mosaicatcher-pipeline.git | ||
cd mosaicatcher-pipeline | ||
``` | ||
|
||
### ⚙️ 3. MosaiCatcher execution (without preprocessing) | ||
|
||
|
Oops, something went wrong.