Merge pull request #1 from sanger-tol/dp24_refactoring

Dp24 refactoring
sanger-tol · Aug 21, 2024 · 844c575 · 844c575
2 parents 56760f8 + b443920
commit 844c575
Show file tree

Hide file tree

Showing 44 changed files with 780 additions and 2,505 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,15 +2,39 @@
 
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+Naming based on: [Mythical creatures](https://en.wikipedia.org/wiki/List_of_legendary_creatures_by_type).
 
-## v1.0dev - [date]
+## v1.0.0 - Aquatic Bahamut [21/08/2024]
 
 Initial release of sanger-tol/ear, created with the [nf-core](https://nf-co.re/) template.
+The current pipeline means the MVP for ear.
 
-### `Added`
+### Added
+GFASTATS to generate statistics on the input primary genome.
+MERQURY_FK to generate kmer graphs and analyses of the primary, haplotype and merged assembly.
+BLOBTOOLKIT to generate busco files and blobtoolkit dataset/plots.
+CURATIONPRETEXT to generate pretext plots and pngs.
 
-### `Fixed`
+### Parameters
 
-### `Dependencies`
+| Old parameter   | New parameter |
+| --------------- | ------------- |
+|                 | --mapped      |
 
-### `Deprecated`
+### Software dependencies
+
+| Dependency  | Old version   | New version   |
+| ----------- | ------------- | ------------- |
+| sanger-tol/blobtoolkit* |               | draft_assemblies        |
+| sanger-tol/curationpretext* |   |  1.0.0 (UNSC Cradle) |
+| GFASTATS |  | 1.3.6--hdcf5f25_3   |
+| MERQUERY_FK  | | 1.2   |
+| MINIMAP2_ALIGN |  | 2.28  |
+| SAMTOOLS_MERGE |  | 1.20--h50ea8bc_0 |
+| SAMTOOLS_SORT  |  | 1.20--h50ea8bc_0 |
+| 
+
+- Note: for pipelines, please check their own CHANGELOG file for a full list of software dependencies.
+
+### Dependencies
+The pipeline depends on a number of databases which are noted in  [README](README.md) and [USAGE](docs/usage.md).
diff --git a/README.md b/README.md
@@ -10,51 +10,74 @@
 
 ## Introduction
 
-**sanger-tol/ear** is a bioinformatics pipeline that ...
+**sanger-tol/ear** is a bioinformatics pipeline that generates the data files required for the the generation of ERGA Assembly Reports. Sanger-tol/ear nests two other sanger-tol pipelines (blobtoolkit and curationpretext).
 
-<!-- TODO nf-core:
-   Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
-   major pipeline sections and the types of output it produces. You're giving an overview to someone new
-   to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
--->
-
-<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
-     workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples.   -->
-<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->
-
-1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
-2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
+1. Read the input yaml file (YAML_INPUT)
+2. Run GFASTATS (GFASTARS)
+3. Run MERQURYFK_MERQURYFK (MERQURYFK)
+4. Run MAIN_MAPPING, longread single-end/paired-end mapping 
+5. Run GENERATE_SAMPLESHEET, generate a csv file required for SANGER_TOL_BTK.
+6. Run SANGER_TOL_BTK, also known as SANGER-TOL/BLOBTOOLKIT a subpipline for SANGER-TOL/EAR
+7. Run SANGER_TOL_CPRETEXT, also known as SANGER-TOL/CURATIONPRETEXT a subpipeline for SANGER-TOL/EAR.
 
 ## Usage
 
 > [!NOTE]
 > If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
 
-<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
-     Explain what rows and columns represent. For instance (please edit as appropriate):
-
-First, prepare a samplesheet with your input data that looks as follows:
-
-`samplesheet.csv`:
-
-```csv
-sample,fastq_1,fastq_2
-CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
+The sanger-tol/ear pipeline requires a number of databases in place in order to run the blobtoolkit pipeline.
+These include:
+   - A blast nt database
+   - A Diamond blast uniprot database
+   - A Diamond blast nr database
+   - An NCBI taxdump
+   - An NCBI rankedlineage.dmp
+
+Next, a yaml file containing the following should then be completed:
+
+```yaml
+# General Vales for all subpiplines and modules
+assembly_id: <NAME OF ASSEMBLY>
+reference_hap1: <LOCATION OF PRIMARY ASSEMBLY FILE .FA>
+reference_hap2: <LOCATION OF HAPLOTYPE ASSEBMLY FILE .FA>
+reference_haplotigs: <LOCATION OF THE HAPLOTIGS FILE, REMOVED DURING CURATION .FA>
+
+# If a mapped bam already exists use the below + --mapped TRUE on the nextflow command else ignore it and the pipeline will create it.
+mapped_bam: <MAPPED BAM .BAM>
+
+merquryfk:
+  fastk_hist: <THE PATH TO THE .HIST FILE>
+  fastk_ktab: <PATH TO THE DIRECTORY CONTAINING THE KTAB FILES, ENSURE THE HIDDEN FILES ARE HERE TOO>
+
+# Used by both subpipelines
+longread:
+  type: <hifi|clr|ont|illumina>
+  dir: <DIRECTORY OF LONGREAD FILES .FASTA.GZ>
+curationpretext:
+  aligner: <minimap2|BWAMEM>
+  telomere_motif: <TELOMERE MOTIF OF SAMPLE>
+  hic_dir: <DIRECTORY OF HIC READ FILES .CRAM AND .CRAI>
+btk:
+  taxid: 1464561
+  lineages: <CSV LIST OF DATABASES TO USE: "insecta_odb10,diptera_odb10">
+  gca_accession: GCA_0001 <DEFAULT, DO NOT CHANGE UNLESS YOU HAVE A GCA_ACCESSION FOR YOUR SPECIES>
+  nt_database: <DIRECTORY CONTAINING BLAST DB>
+  nt_database_prefix: <BLASTDB PREFIX>
+  diamond_uniprot_database_path: <PATH TO reference_proteomes.dmnd FROM UNIPROT>
+  diamond_nr_database_path: <PATH TO nr.dmnd>
+  ncbi_taxonomy_path: <DIRECTORY CONTAINING THE TAXDUMP>
+  ncbi_rankedlineage_path: <FOLDER CONTAINING THE rankedlineage.dmp FILE>
+  config: <PATH TO ear/conf/sanger-tol-btk.config TO OVERWRITE PROCESS LIMITS>
 ```
 
-Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
-
--->
 
 Now, you can run the pipeline using:
 
-<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->
-
 ```bash
-nextflow run sanger-tol/ear \
-   -profile <docker/singularity/.../institute> \
-   --input samplesheet.csv \
-   --outdir <OUTDIR>
+nextflow run sanger-tol/ear -profile <singularity,docker> \\
+   --input assets/idCulLati1.yaml \\
+   --mapped TRUE \\ # OPTIONAL
+   --outdir test-truth
 ```
 
 > [!WARNING]
@@ -65,10 +88,6 @@ nextflow run sanger-tol/ear \
 
 sanger-tol/ear was originally written by DLBPointon.
 
-We thank the following people for their extensive assistance in the development of this pipeline:
-
-<!-- TODO nf-core: If applicable, make list of people who have also contributed -->
-
 ## Contributions and Support
 
 If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md).

diff --git a/assets/idCulLati1.yaml b/assets/idCulLati1.yaml
@@ -0,0 +1,33 @@
+# General Vales for all subpiplines and modules
+assembly_id: idCulLati1_ear
+reference_hap1: /nfs/treeoflife-01/teams/tola/users/dp24/ear/idCulLati1/primary.fa
+reference_hap2: /nfs/treeoflife-01/teams/tola/users/dp24/ear/idCulLati1/hap2.fa
+reference_haplotigs: /
+
+# If a mapped bam already exists use the below + --mapped TRUE on the nextflow command else ignore.
+mapped_bam: /nfs/treeoflife-01/teams/tola/users/dp24/ear/idCulLati1/mapped_bam.bam
+
+merquryfk:
+  fastk_hist: /lustre/scratch122/tol/data/a/5/e/1/6/d/Culex_laticinctus/genomic_data/idCulLati1/pacbio/kmer/k31/idCulLati1.k31.hist
+  fastk_ktab: /lustre/scratch122/tol/data/a/5/e/1/6/d/Culex_laticinctus/genomic_data/idCulLati1/pacbio/kmer/k31/
+
+# Used by both subpipelines
+longread:
+  type: hifi
+  dir: /lustre/scratch122/tol/data/a/5/e/1/6/d/Culex_laticinctus/genomic_data/idCulLati1/pacbio/fasta/
+curationpretext:
+  aligner: minimap2
+  telomere_motif: TTAGG
+  hic_dir: /lustre/scratch122/tol/data/a/5/e/1/6/d/Culex_laticinctus/genomic_data/idCulLati2/hic-arima2/
+btk:
+  taxid: 1464561
+  lineages: "insecta_odb10"
+  gca_accession: GCA_0001
+  nt_database: /data/blastdb/Supported/NT/current
+  nt_database_prefix: nt
+  diamond_uniprot_database_path: /lustre/scratch123/tol/resources/uniprot_reference_proteomes/latest/reference_proteomes.dmnd
+  diamond_nr_database_path: /lustre/scratch123/tol/resources/nr/latest/nr.dmnd
+  ncbi_taxonomy_path: /lustre/scratch123/tol/resources/taxonomy/latest/new_taxdump/
+  ncbi_rankedlineage_path: /lustre/scratch123/tol/resources/taxonomy/latest/new_taxdump/rankedlineage.dmp
+  btk_yaml: /nfs/users/nfs_d/dp24/sanger-tol-ear/assets/btk_draft.yaml
+  config: /nfs/treeoflife-01/teams/tola/users/dp24/ear/conf/sanger-tol-btk.config
diff --git a/assets/real_pdf.yaml b/assets/real_pdf.yaml
@@ -0,0 +1,45 @@
+# SAMPLE INFORMATION
+ToLID: idCulLati1
+Species: Culex laticinctus
+Sex: XX
+Submitter: Michael Paulini
+Affiliation: WSI
+Tags: ERGA-BGE
+
+# SEQUENCING DATA
+DATA:
+  - PacBio HiFi: 51x
+  - Arima v2: 152x
+
+# GENOME PROFILING DATA
+PROFILING:
+  GenomeScope:
+    version: 2.0
+    results_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/genomic_data/idCulLati1/pacbio/kmer/k31/
+
+# ASSEMBLY DATA
+ASSEMBLIES:
+  Pre-curation:
+    pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e,  yahs_v1.2a.2|]
+    pri:
+      gfastats--nstar-report_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.fa.gz.gfastats
+      busco_short_summary_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.insecta_odb10.busco/short_summary.specific.insecta_odb10.out_scaffolds_final.insecta_odb10.busco.txt
+      merqury_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.ccs.merquryk/
+
+  Curated:
+    pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e,  yahs_v1.2a.2|, TreeVal_v1.1]
+    pri:
+      gfastats--nstar-report_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1.primary.curated.fa.gfastats
+      busco_short_summary_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1.primary.curated.insecta_odb10.busco/short_summary.specific.insecta_odb10.idCulLati1.1.primary.curated.insecta_odb10.busco.txt
+      merqury_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1.primary.curated.ccs.merquryk/
+      hic_FullMap_png: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1_normal_snapshots/idCulLati1.1_normal_FullMap.png
+      hic_FullMap_link: https://tolqc.cog.sanger.ac.uk/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1_normal.pretext
+      blobplot_cont_png: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1_primary_curated_btk_busco.blob.circle.png
+
+# CURATION NOTES
+NOTES:
+  Obs_Haploid_num: 3
+  Obs_Sex: XX
+  Interventions_per_Gb: 430
+  Contamination_notes: "Total length of scaffolds removed: 989,717 (0.1 %)\nScaffolds removed: 1 (0.2 %)\nLargest scaffold removed: (989,717)\nFCS-GX contaminant species (number of scaffolds; total length of scaffolds):\nWolbachia endosymbiont (group B) of Melanostoma mellinum, a-proteobacteria (1; 989,717)"
+  Other_notes: "Chromosomes named by size"
diff --git a/assets/template_pdf.yaml b/assets/template_pdf.yaml
@@ -0,0 +1,45 @@
+# SAMPLE INFORMATION
+ToLID: <SAMPLE_ID>
+Species: <LATIN_NAME>
+Sex: <EXPECTED_SEX>
+Submitter: <CURATOR>
+Affiliation: WSI
+Tags: ERGA-BGE
+
+# SEQUENCING DATA
+DATA:
+  - PacBio HiFi: <PACBIO_COVERAGE>
+  - Arima v2: <ARIMA_COVERAGE>
+
+# GENOME PROFILING DATA
+PROFILING:
+  GenomeScope:
+    version: 2.0
+    results_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/genomic_data/idCulLati1/pacbio/kmer/k31/
+
+# ASSEMBLY DATA
+ASSEMBLIES:
+  Pre-curation:
+    pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e,  yahs_v1.2a.2|]
+    pri:
+      gfastats--nstar-report_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.fa.gz.gfastats
+      busco_short_summary_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.insecta_odb10.busco/short_summary.specific.insecta_odb10.out_scaffolds_final.insecta_odb10.busco.txt
+      merqury_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.ccs.merquryk/
+
+  Curated:
+    pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e,  yahs_v1.2a.2|, TreeVal_v1.1]
+    pri:
+      gfastats--nstar-report_txt: idCulLati1.1.primary.curated.fa.gfastats
+      busco_short_summary_txt: short_summary.specific.insecta_odb10.idCulLati1.1.primary.curated.insecta_odb10.busco.txt
+      merqury_folder: <POST_CURATION_MERQURY_FOLDER>
+      hic_FullMap_png: <CURATION_PRETEXT_PRETEXT_MAP_PNG>
+      hic_FullMap_link: https://tolqc.cog.sanger.ac.uk/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1_normal.pretext
+      blobplot_cont_png: idCulLati1.1_primary_curated_btk_busco.blob.circle.png
+
+# CURATION NOTES
+NOTES:
+  Obs_Haploid_num: <OBSERVED_HAPLOID_CHROMOSOME_COUNT>
+  Obs_Sex: <OBSERVED_SEX>
+  Interventions_per_Gb: <MANUAL_INTERVENTIONS_PER_GB>
+  Contamination_notes: "Total length of scaffolds removed: 989,717 (0.1 %)\nScaffolds removed: 1 (0.2 %)\nLargest scaffold removed: (989,717)\nFCS-GX contaminant species (number of scaffolds; total length of scaffolds):\nWolbachia endosymbiont (group B) of Melanostoma mellinum, a-proteobacteria (1; 989,717)"
+  Other_notes: "Chromosomes named by size"
diff --git a/assets/test.yaml b/assets/test.yaml
@@ -4,6 +4,7 @@ reference_hap2: /nfs/treeoflife-01/teams/tola/users/dp24/ascc/asccTinyTest_V2/as
 longread:
   type: hifi
   dir: /lustre/scratch123/tol/resources/treeval/treeval-testdata/TreeValSmallData/Oscheius_DF5033/genomic_data/nxOscSpes1/pacbio/fasta/
+mapped_bam: idCulLati1/mapped_bam.bam
 curationpretext:
   aligner: minimap2
   telomere_motif: TTAGG

diff --git a/conf/base.config b/conf/base.config
@@ -19,6 +19,10 @@ process {
     maxRetries    = 1
     maxErrors     = '-1'
 
+    withName: "SANGER_TOL_CPRETEXT|SANGER_TOL_BTK" {
+        time    = { check_max( 70.h  * task.attempt, 'time'   ) }
+    }
+
     // Process-specific resource requirements
     // NOTE - Please try and re-use the labels below as much as possible.
     //        These labels are used and recognised by default in DSL2 files hosted on nf-core/modules.

diff --git a/conf/modules.config b/conf/modules.config
@@ -12,30 +12,42 @@
 
 process {
 
-    publishDir = [
-        path: { "${params.outdir}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" },
-        mode: params.publish_dir_mode,
-        saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
-    ]
+    withName: "GFASTATS|MERQURYFK_MERQURYFK|SANGER_TOL_BTK|SANGER_TOL_CPRETEXT" {
+        publishDir = [
+            path: { "${params.outdir}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" },
+            mode: params.publish_dir_mode,
+            saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
+        ]
+    }
 
     withName: GFASTATS {
-        ext.args = '--nstar-report'
+        ext.args            = '--nstar-report'
     }
 
     withName: MERQURYFK_MERQURYFK {
-        ext.args        = "-P."
+        ext.args            = "-P."
     }
 
     withName: SAMTOOLS_SORT {
-        ext.prefix      = { "${meta.id}_sorted"}
+        ext.prefix          = { "${meta.id}_sorted"}
     }
 
     withName: SANGER_TOL_BTK {
-        ext.args            = "--blastx_outext 'txt'"
+        ext.pipeline_name   = "sanger-tol/blobtoolkit"
+        ext.args            = ""
         ext.executor        = "bsub -Is -tty -e test.e -o test.log -n 2 -q oversubscribed -M1400 -R'select[mem>1400] rusage[mem=1400] span[hosts=1]'"
         ext.profiles        = "singularity,sanger"
         ext.get_versions    = "lsid | head -n1 | cut -d ',' -f 1"
         ext.version         = "draft_assemblies"
     }
 
+    withName: SANGER_TOL_CPRETEXT {
+        ext.pipeline_name   = "sanger-tol/curationpretext"
+        ext.args            = ""
+        ext.executor        = "bsub -Is -tty -e test.e -o test.log -n 2 -q oversubscribed -M1400 -R'select[mem>1400] rusage[mem=1400] span[hosts=1]'"
+        ext.profiles        = "singularity,sanger"
+        ext.get_versions    = "lsid | head -n1 | cut -d ',' -f 1"
+        ext.version         = "1.0.0"
+    }
+
 }
diff --git a/conf/sanger-tol-btk.config b/conf/sanger-tol-btk.config
@@ -0,0 +1,7 @@
+process {
+    withLabel:RUN_BLASTN:BLASTN_TAXON {
+        cpus   = { check_max( 12    * task.attempt, 'cpus'    ) }
+        memory = { check_max( 10.GB * task.attempt, 'memory'  ) }
+        time   = { check_max( 16.h  * task.attempt, 'time'    ) }
+    }
+}