diff --git a/CHANGELOG.md b/CHANGELOG.md index c3d90bc..3173f7c 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,15 +2,39 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +Naming based on: [Mythical creatures](https://en.wikipedia.org/wiki/List_of_legendary_creatures_by_type). -## v1.0dev - [date] +## v1.0.0 - Aquatic Bahamut [21/08/2024] Initial release of sanger-tol/ear, created with the [nf-core](https://nf-co.re/) template. +The current pipeline means the MVP for ear. -### `Added` +### Added +GFASTATS to generate statistics on the input primary genome. +MERQURY_FK to generate kmer graphs and analyses of the primary, haplotype and merged assembly. +BLOBTOOLKIT to generate busco files and blobtoolkit dataset/plots. +CURATIONPRETEXT to generate pretext plots and pngs. -### `Fixed` +### Parameters -### `Dependencies` +| Old parameter | New parameter | +| --------------- | ------------- | +| | --mapped | -### `Deprecated` +### Software dependencies + +| Dependency | Old version | New version | +| ----------- | ------------- | ------------- | +| sanger-tol/blobtoolkit* | | draft_assemblies | +| sanger-tol/curationpretext* | | 1.0.0 (UNSC Cradle) | +| GFASTATS | | 1.3.6--hdcf5f25_3 | +| MERQUERY_FK | | 1.2 | +| MINIMAP2_ALIGN | | 2.28 | +| SAMTOOLS_MERGE | | 1.20--h50ea8bc_0 | +| SAMTOOLS_SORT | | 1.20--h50ea8bc_0 | +| + +- Note: for pipelines, please check their own CHANGELOG file for a full list of software dependencies. + +### Dependencies +The pipeline depends on a number of databases which are noted in [README](README.md) and [USAGE](docs/usage.md). diff --git a/README.md b/README.md index 506512d..652eba6 100644 --- a/README.md +++ b/README.md @@ -10,51 +10,74 @@ ## Introduction -**sanger-tol/ear** is a bioinformatics pipeline that ... +**sanger-tol/ear** is a bioinformatics pipeline that generates the data files required for the the generation of ERGA Assembly Reports. Sanger-tol/ear nests two other sanger-tol pipelines (blobtoolkit and curationpretext). - - - - - -1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)) -2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/)) +1. Read the input yaml file (YAML_INPUT) +2. Run GFASTATS (GFASTARS) +3. Run MERQURYFK_MERQURYFK (MERQURYFK) +4. Run MAIN_MAPPING, longread single-end/paired-end mapping +5. Run GENERATE_SAMPLESHEET, generate a csv file required for SANGER_TOL_BTK. +6. Run SANGER_TOL_BTK, also known as SANGER-TOL/BLOBTOOLKIT a subpipline for SANGER-TOL/EAR +7. Run SANGER_TOL_CPRETEXT, also known as SANGER-TOL/CURATIONPRETEXT a subpipeline for SANGER-TOL/EAR. ## Usage > [!NOTE] > If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data. - Now, you can run the pipeline using: - - ```bash -nextflow run sanger-tol/ear \ - -profile \ - --input samplesheet.csv \ - --outdir +nextflow run sanger-tol/ear -profile \\ + --input assets/idCulLati1.yaml \\ + --mapped TRUE \\ # OPTIONAL + --outdir test-truth ``` > [!WARNING] @@ -65,10 +88,6 @@ nextflow run sanger-tol/ear \ sanger-tol/ear was originally written by DLBPointon. -We thank the following people for their extensive assistance in the development of this pipeline: - - - ## Contributions and Support If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md). diff --git a/assets/idCulLati1.yaml b/assets/idCulLati1.yaml new file mode 100644 index 0000000..ea48cc2 --- /dev/null +++ b/assets/idCulLati1.yaml @@ -0,0 +1,33 @@ +# General Vales for all subpiplines and modules +assembly_id: idCulLati1_ear +reference_hap1: /nfs/treeoflife-01/teams/tola/users/dp24/ear/idCulLati1/primary.fa +reference_hap2: /nfs/treeoflife-01/teams/tola/users/dp24/ear/idCulLati1/hap2.fa +reference_haplotigs: / + +# If a mapped bam already exists use the below + --mapped TRUE on the nextflow command else ignore. +mapped_bam: /nfs/treeoflife-01/teams/tola/users/dp24/ear/idCulLati1/mapped_bam.bam + +merquryfk: + fastk_hist: /lustre/scratch122/tol/data/a/5/e/1/6/d/Culex_laticinctus/genomic_data/idCulLati1/pacbio/kmer/k31/idCulLati1.k31.hist + fastk_ktab: /lustre/scratch122/tol/data/a/5/e/1/6/d/Culex_laticinctus/genomic_data/idCulLati1/pacbio/kmer/k31/ + +# Used by both subpipelines +longread: + type: hifi + dir: /lustre/scratch122/tol/data/a/5/e/1/6/d/Culex_laticinctus/genomic_data/idCulLati1/pacbio/fasta/ +curationpretext: + aligner: minimap2 + telomere_motif: TTAGG + hic_dir: /lustre/scratch122/tol/data/a/5/e/1/6/d/Culex_laticinctus/genomic_data/idCulLati2/hic-arima2/ +btk: + taxid: 1464561 + lineages: "insecta_odb10" + gca_accession: GCA_0001 + nt_database: /data/blastdb/Supported/NT/current + nt_database_prefix: nt + diamond_uniprot_database_path: /lustre/scratch123/tol/resources/uniprot_reference_proteomes/latest/reference_proteomes.dmnd + diamond_nr_database_path: /lustre/scratch123/tol/resources/nr/latest/nr.dmnd + ncbi_taxonomy_path: /lustre/scratch123/tol/resources/taxonomy/latest/new_taxdump/ + ncbi_rankedlineage_path: /lustre/scratch123/tol/resources/taxonomy/latest/new_taxdump/rankedlineage.dmp + btk_yaml: /nfs/users/nfs_d/dp24/sanger-tol-ear/assets/btk_draft.yaml + config: /nfs/treeoflife-01/teams/tola/users/dp24/ear/conf/sanger-tol-btk.config diff --git a/assets/real_pdf.yaml b/assets/real_pdf.yaml new file mode 100644 index 0000000..8f8d4a0 --- /dev/null +++ b/assets/real_pdf.yaml @@ -0,0 +1,45 @@ +# SAMPLE INFORMATION +ToLID: idCulLati1 +Species: Culex laticinctus +Sex: XX +Submitter: Michael Paulini +Affiliation: WSI +Tags: ERGA-BGE + +# SEQUENCING DATA +DATA: + - PacBio HiFi: 51x + - Arima v2: 152x + +# GENOME PROFILING DATA +PROFILING: + GenomeScope: + version: 2.0 + results_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/genomic_data/idCulLati1/pacbio/kmer/k31/ + +# ASSEMBLY DATA +ASSEMBLIES: + Pre-curation: + pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|] + pri: + gfastats--nstar-report_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.fa.gz.gfastats + busco_short_summary_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.insecta_odb10.busco/short_summary.specific.insecta_odb10.out_scaffolds_final.insecta_odb10.busco.txt + merqury_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.ccs.merquryk/ + + Curated: + pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|, TreeVal_v1.1] + pri: + gfastats--nstar-report_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1.primary.curated.fa.gfastats + busco_short_summary_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1.primary.curated.insecta_odb10.busco/short_summary.specific.insecta_odb10.idCulLati1.1.primary.curated.insecta_odb10.busco.txt + merqury_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1.primary.curated.ccs.merquryk/ + hic_FullMap_png: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1_normal_snapshots/idCulLati1.1_normal_FullMap.png + hic_FullMap_link: https://tolqc.cog.sanger.ac.uk/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1_normal.pretext + blobplot_cont_png: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1_primary_curated_btk_busco.blob.circle.png + +# CURATION NOTES +NOTES: + Obs_Haploid_num: 3 + Obs_Sex: XX + Interventions_per_Gb: 430 + Contamination_notes: "Total length of scaffolds removed: 989,717 (0.1 %)\nScaffolds removed: 1 (0.2 %)\nLargest scaffold removed: (989,717)\nFCS-GX contaminant species (number of scaffolds; total length of scaffolds):\nWolbachia endosymbiont (group B) of Melanostoma mellinum, a-proteobacteria (1; 989,717)" + Other_notes: "Chromosomes named by size" diff --git a/assets/template_pdf.yaml b/assets/template_pdf.yaml new file mode 100644 index 0000000..3779c19 --- /dev/null +++ b/assets/template_pdf.yaml @@ -0,0 +1,45 @@ +# SAMPLE INFORMATION +ToLID: +Species: +Sex: +Submitter: +Affiliation: WSI +Tags: ERGA-BGE + +# SEQUENCING DATA +DATA: + - PacBio HiFi: + - Arima v2: + +# GENOME PROFILING DATA +PROFILING: + GenomeScope: + version: 2.0 + results_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/genomic_data/idCulLati1/pacbio/kmer/k31/ + +# ASSEMBLY DATA +ASSEMBLIES: + Pre-curation: + pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|] + pri: + gfastats--nstar-report_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.fa.gz.gfastats + busco_short_summary_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.insecta_odb10.busco/short_summary.specific.insecta_odb10.out_scaffolds_final.insecta_odb10.busco.txt + merqury_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.ccs.merquryk/ + + Curated: + pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|, TreeVal_v1.1] + pri: + gfastats--nstar-report_txt: idCulLati1.1.primary.curated.fa.gfastats + busco_short_summary_txt: short_summary.specific.insecta_odb10.idCulLati1.1.primary.curated.insecta_odb10.busco.txt + merqury_folder: + hic_FullMap_png: + hic_FullMap_link: https://tolqc.cog.sanger.ac.uk/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1_normal.pretext + blobplot_cont_png: idCulLati1.1_primary_curated_btk_busco.blob.circle.png + +# CURATION NOTES +NOTES: + Obs_Haploid_num: + Obs_Sex: + Interventions_per_Gb: + Contamination_notes: "Total length of scaffolds removed: 989,717 (0.1 %)\nScaffolds removed: 1 (0.2 %)\nLargest scaffold removed: (989,717)\nFCS-GX contaminant species (number of scaffolds; total length of scaffolds):\nWolbachia endosymbiont (group B) of Melanostoma mellinum, a-proteobacteria (1; 989,717)" + Other_notes: "Chromosomes named by size" diff --git a/assets/test.yaml b/assets/test.yaml index d4da164..6a5299a 100755 --- a/assets/test.yaml +++ b/assets/test.yaml @@ -4,6 +4,7 @@ reference_hap2: /nfs/treeoflife-01/teams/tola/users/dp24/ascc/asccTinyTest_V2/as longread: type: hifi dir: /lustre/scratch123/tol/resources/treeval/treeval-testdata/TreeValSmallData/Oscheius_DF5033/genomic_data/nxOscSpes1/pacbio/fasta/ +mapped_bam: idCulLati1/mapped_bam.bam curationpretext: aligner: minimap2 telomere_motif: TTAGG diff --git a/conf/base.config b/conf/base.config index 4136c84..e609a9e 100644 --- a/conf/base.config +++ b/conf/base.config @@ -19,6 +19,10 @@ process { maxRetries = 1 maxErrors = '-1' + withName: "SANGER_TOL_CPRETEXT|SANGER_TOL_BTK" { + time = { check_max( 70.h * task.attempt, 'time' ) } + } + // Process-specific resource requirements // NOTE - Please try and re-use the labels below as much as possible. // These labels are used and recognised by default in DSL2 files hosted on nf-core/modules. diff --git a/conf/modules.config b/conf/modules.config index a96a69f..137b892 100644 --- a/conf/modules.config +++ b/conf/modules.config @@ -12,30 +12,42 @@ process { - publishDir = [ - path: { "${params.outdir}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" }, - mode: params.publish_dir_mode, - saveAs: { filename -> filename.equals('versions.yml') ? null : filename } - ] + withName: "GFASTATS|MERQURYFK_MERQURYFK|SANGER_TOL_BTK|SANGER_TOL_CPRETEXT" { + publishDir = [ + path: { "${params.outdir}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" }, + mode: params.publish_dir_mode, + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + } withName: GFASTATS { - ext.args = '--nstar-report' + ext.args = '--nstar-report' } withName: MERQURYFK_MERQURYFK { - ext.args = "-P." + ext.args = "-P." } withName: SAMTOOLS_SORT { - ext.prefix = { "${meta.id}_sorted"} + ext.prefix = { "${meta.id}_sorted"} } withName: SANGER_TOL_BTK { - ext.args = "--blastx_outext 'txt'" + ext.pipeline_name = "sanger-tol/blobtoolkit" + ext.args = "" ext.executor = "bsub -Is -tty -e test.e -o test.log -n 2 -q oversubscribed -M1400 -R'select[mem>1400] rusage[mem=1400] span[hosts=1]'" ext.profiles = "singularity,sanger" ext.get_versions = "lsid | head -n1 | cut -d ',' -f 1" ext.version = "draft_assemblies" } + withName: SANGER_TOL_CPRETEXT { + ext.pipeline_name = "sanger-tol/curationpretext" + ext.args = "" + ext.executor = "bsub -Is -tty -e test.e -o test.log -n 2 -q oversubscribed -M1400 -R'select[mem>1400] rusage[mem=1400] span[hosts=1]'" + ext.profiles = "singularity,sanger" + ext.get_versions = "lsid | head -n1 | cut -d ',' -f 1" + ext.version = "1.0.0" + } + } diff --git a/conf/sanger-tol-btk.config b/conf/sanger-tol-btk.config new file mode 100644 index 0000000..247dbbd --- /dev/null +++ b/conf/sanger-tol-btk.config @@ -0,0 +1,7 @@ +process { + withLabel:RUN_BLASTN:BLASTN_TAXON { + cpus = { check_max( 12 * task.attempt, 'cpus' ) } + memory = { check_max( 10.GB * task.attempt, 'memory' ) } + time = { check_max( 16.h * task.attempt, 'time' ) } + } +} \ No newline at end of file diff --git a/docs/output.md b/docs/output.md index 335ec21..f5a9c8b 100644 --- a/docs/output.md +++ b/docs/output.md @@ -6,54 +6,80 @@ This document describes the output produced by the pipeline. Most of the plots a The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. - - ## Pipeline overview The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps: -- [FastQC](#fastqc) - Raw read QC -- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline +- [GFASTATS](#gfastats) - Collect statistics on the curated primary assembly +- [MERQURYFK](#merquryfk) - Generate kmer plots for the curated assembly using previous run information +- [SANGER_TOL_BTK](#sanger_tol_btk) - Run Blobtoolkit to generate plots and short_summary.txt from BUSCO. +- [SANGER_TOL_CPRETEXT](#sanger_tol_cpretext) - Run Curationpretext to generate Pretext files and accessory tracks. - [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution -### FastQC +### GFASTATS + +
+Output files + +- `gfastats/` + - `*.assembly.summary`: Assembly metrics of the input primary file. + - `*_fasta.gz`: GZipped primary assembly file. + +
+ +[GFASTATS](https://github.com/vgl-hub/gfastats) is a single fast and exhaustive tool for summary statistics and simultaneous *fa* (fasta, fastq, gfa [.gz]) genome assembly file manipulation. + +### MERQURYFK
Output files -- `fastqc/` - - `*_fastqc.html`: FastQC report containing quality metrics. - - `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images. +- `merquryfk/` + - `*.completeness.stats`: + - `*{"primary","haplotype",""}_only.bed`: + - `*{"primary","haplotype",""}.qv`: + - `*.spectra-asm.{fl,ln,st}.png`: + - `*{"primary","haplotype"}.spectra-cn.{fl,ln,st}.png`:
-[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/). +[MERQURYFK](https://github.com/thegenemyers/MERQURY.FK) is a FastK based version of Merqury. + +Merqury is a novel tool for reference-free assembly evaluation based on efficient k-mer set operations. By comparing k-mers in a de novo assembly to those found in unassembled high-accuracy reads, Merqury estimates base-level accuracy and completeness. + + +## SANGER_TOL_BTK -![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png) +
+Output files -![MultiQC - FastQC mean quality scores plot](images/mqc_fastqc_quality.png) +- `sanger/*_blobtoolkit_out/` + - `blobtoolkit/plots/*png`: Blobtoolkit plots + - `blobtoolkit/{ASSEMBLY_NAME}/*.json.gz`: Blobtoolkit dataset for use in BTK_viewer. + - `busco/*_odb10/*.{tsv,tar.gz,json,txt}`: Busco output + - `muliqc/`: MultiQC plots/data and report.html. + - [`pipeline_info`](#pipeline-information) + +
-![MultiQC - FastQC adapter content plot](images/mqc_fastqc_adapter.png) +[SANGER_TOL_BTK](https://pipelines.tol.sanger.ac.uk/blobtoolkit) is a bioinformatics pipeline that can be used to identify and analyse non-target DNA for eukaryotic genomes. -:::note -The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality. -::: -### MultiQC +## SANGER_TOL_CPRETEXT
Output files -- `multiqc/` - - `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser. - - `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline. - - `multiqc_plots/`: directory containing static images from the report in various formats. +- `sanger/*_curationpretext_out/` + - `accessory_files/*.{bigWig,bed,bedgraph}`: Track files describing Telomere, gap, coverage data across the genome. + - `pretext_maps_raw`: Pre-accessory file ingestion pretext files. + - `pretext_maps_processed`: Post-accessory file ingestion pretext files, e.g. the final output. + - [`pipeline_info`](#pipeline-information)
-[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory. +[SANGER_TOL_CPRETEXT](https://pipelines.tol.sanger.ac.uk/curationpretext) is a bioinformatics pipeline typically used in conjunction with [TreeVal](https://pipelines.tol.sanger.ac.uk/treeval) to generate pretext maps (and optionally telomeric, gap, coverage, and repeat density plots which can be ingested into pretext) for the manual curation of high quality genomes. -Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see . ### Pipeline information diff --git a/docs/usage.md b/docs/usage.md index 42521d3..b703d3e 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -6,60 +6,179 @@ -## Samplesheet input +## Yaml input -You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row as shown in the examples below. +You will need to create a yaml with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. ```bash --input '[path to samplesheet file]' ``` -### Multiple runs of the same sample +The structure of this file should be as follows: -The `sample` identifiers have to be the same when you have re-sequenced the same sample more than once e.g. to increase sequencing depth. The pipeline will concatenate the raw reads before performing any downstream analysis. Below is an example for the same sample sequenced across 3 lanes: +```yaml +# General Vales for all subpiplines and modules +assembly_id: +reference_hap1: +reference_hap2: +reference_haplotigs: + +# If a mapped bam already exists use the below + --mapped TRUE on the nextflow command else ignore it and the pipeline will create it. +mapped_bam: + +merquryfk: + fastk_hist: + fastk_ktab: