Skip to content

Commit

Permalink
Merge pull request #1 from sanger-tol/dp24_refactoring
Browse files Browse the repository at this point in the history
Dp24 refactoring
  • Loading branch information
DLBPointon authored Aug 21, 2024
2 parents 56760f8 + b443920 commit 844c575
Show file tree
Hide file tree
Showing 44 changed files with 780 additions and 2,505 deletions.
34 changes: 29 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,39 @@

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
Naming based on: [Mythical creatures](https://en.wikipedia.org/wiki/List_of_legendary_creatures_by_type).

## v1.0dev - [date]
## v1.0.0 - Aquatic Bahamut [21/08/2024]

Initial release of sanger-tol/ear, created with the [nf-core](https://nf-co.re/) template.
The current pipeline means the MVP for ear.

### `Added`
### Added
GFASTATS to generate statistics on the input primary genome.
MERQURY_FK to generate kmer graphs and analyses of the primary, haplotype and merged assembly.
BLOBTOOLKIT to generate busco files and blobtoolkit dataset/plots.
CURATIONPRETEXT to generate pretext plots and pngs.

### `Fixed`
### Parameters

### `Dependencies`
| Old parameter | New parameter |
| --------------- | ------------- |
| | --mapped |

### `Deprecated`
### Software dependencies

| Dependency | Old version | New version |
| ----------- | ------------- | ------------- |
| sanger-tol/blobtoolkit* | | draft_assemblies |
| sanger-tol/curationpretext* | | 1.0.0 (UNSC Cradle) |
| GFASTATS | | 1.3.6--hdcf5f25_3 |
| MERQUERY_FK | | 1.2 |
| MINIMAP2_ALIGN | | 2.28 |
| SAMTOOLS_MERGE | | 1.20--h50ea8bc_0 |
| SAMTOOLS_SORT | | 1.20--h50ea8bc_0 |
|

- Note: for pipelines, please check their own CHANGELOG file for a full list of software dependencies.

### Dependencies
The pipeline depends on a number of databases which are noted in [README](README.md) and [USAGE](docs/usage.md).
91 changes: 55 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,51 +10,74 @@

## Introduction

**sanger-tol/ear** is a bioinformatics pipeline that ...
**sanger-tol/ear** is a bioinformatics pipeline that generates the data files required for the the generation of ERGA Assembly Reports. Sanger-tol/ear nests two other sanger-tol pipelines (blobtoolkit and curationpretext).

<!-- TODO nf-core:
Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
major pipeline sections and the types of output it produces. You're giving an overview to someone new
to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
-->

<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples. -->
<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->

1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
1. Read the input yaml file (YAML_INPUT)
2. Run GFASTATS (GFASTARS)
3. Run MERQURYFK_MERQURYFK (MERQURYFK)
4. Run MAIN_MAPPING, longread single-end/paired-end mapping
5. Run GENERATE_SAMPLESHEET, generate a csv file required for SANGER_TOL_BTK.
6. Run SANGER_TOL_BTK, also known as SANGER-TOL/BLOBTOOLKIT a subpipline for SANGER-TOL/EAR
7. Run SANGER_TOL_CPRETEXT, also known as SANGER-TOL/CURATIONPRETEXT a subpipeline for SANGER-TOL/EAR.

## Usage

> [!NOTE]
> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
Explain what rows and columns represent. For instance (please edit as appropriate):
First, prepare a samplesheet with your input data that looks as follows:
`samplesheet.csv`:
```csv
sample,fastq_1,fastq_2
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
The sanger-tol/ear pipeline requires a number of databases in place in order to run the blobtoolkit pipeline.
These include:
- A blast nt database
- A Diamond blast uniprot database
- A Diamond blast nr database
- An NCBI taxdump
- An NCBI rankedlineage.dmp

Next, a yaml file containing the following should then be completed:

```yaml
# General Vales for all subpiplines and modules
assembly_id: <NAME OF ASSEMBLY>
reference_hap1: <LOCATION OF PRIMARY ASSEMBLY FILE .FA>
reference_hap2: <LOCATION OF HAPLOTYPE ASSEBMLY FILE .FA>
reference_haplotigs: <LOCATION OF THE HAPLOTIGS FILE, REMOVED DURING CURATION .FA>

# If a mapped bam already exists use the below + --mapped TRUE on the nextflow command else ignore it and the pipeline will create it.
mapped_bam: <MAPPED BAM .BAM>

merquryfk:
fastk_hist: <THE PATH TO THE .HIST FILE>
fastk_ktab: <PATH TO THE DIRECTORY CONTAINING THE KTAB FILES, ENSURE THE HIDDEN FILES ARE HERE TOO>

# Used by both subpipelines
longread:
type: <hifi|clr|ont|illumina>
dir: <DIRECTORY OF LONGREAD FILES .FASTA.GZ>
curationpretext:
aligner: <minimap2|BWAMEM>
telomere_motif: <TELOMERE MOTIF OF SAMPLE>
hic_dir: <DIRECTORY OF HIC READ FILES .CRAM AND .CRAI>
btk:
taxid: 1464561
lineages: <CSV LIST OF DATABASES TO USE: "insecta_odb10,diptera_odb10">
gca_accession: GCA_0001 <DEFAULT, DO NOT CHANGE UNLESS YOU HAVE A GCA_ACCESSION FOR YOUR SPECIES>
nt_database: <DIRECTORY CONTAINING BLAST DB>
nt_database_prefix: <BLASTDB PREFIX>
diamond_uniprot_database_path: <PATH TO reference_proteomes.dmnd FROM UNIPROT>
diamond_nr_database_path: <PATH TO nr.dmnd>
ncbi_taxonomy_path: <DIRECTORY CONTAINING THE TAXDUMP>
ncbi_rankedlineage_path: <FOLDER CONTAINING THE rankedlineage.dmp FILE>
config: <PATH TO ear/conf/sanger-tol-btk.config TO OVERWRITE PROCESS LIMITS>
```
Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
-->
Now, you can run the pipeline using:
<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->

```bash
nextflow run sanger-tol/ear \
-profile <docker/singularity/.../institute> \
--input samplesheet.csv \
--outdir <OUTDIR>
nextflow run sanger-tol/ear -profile <singularity,docker> \\
--input assets/idCulLati1.yaml \\
--mapped TRUE \\ # OPTIONAL
--outdir test-truth
```

> [!WARNING]
Expand All @@ -65,10 +88,6 @@ nextflow run sanger-tol/ear \

sanger-tol/ear was originally written by DLBPointon.

We thank the following people for their extensive assistance in the development of this pipeline:

<!-- TODO nf-core: If applicable, make list of people who have also contributed -->

## Contributions and Support

If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md).
Expand Down
33 changes: 33 additions & 0 deletions assets/idCulLati1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# General Vales for all subpiplines and modules
assembly_id: idCulLati1_ear
reference_hap1: /nfs/treeoflife-01/teams/tola/users/dp24/ear/idCulLati1/primary.fa
reference_hap2: /nfs/treeoflife-01/teams/tola/users/dp24/ear/idCulLati1/hap2.fa
reference_haplotigs: /

# If a mapped bam already exists use the below + --mapped TRUE on the nextflow command else ignore.
mapped_bam: /nfs/treeoflife-01/teams/tola/users/dp24/ear/idCulLati1/mapped_bam.bam

merquryfk:
fastk_hist: /lustre/scratch122/tol/data/a/5/e/1/6/d/Culex_laticinctus/genomic_data/idCulLati1/pacbio/kmer/k31/idCulLati1.k31.hist
fastk_ktab: /lustre/scratch122/tol/data/a/5/e/1/6/d/Culex_laticinctus/genomic_data/idCulLati1/pacbio/kmer/k31/

# Used by both subpipelines
longread:
type: hifi
dir: /lustre/scratch122/tol/data/a/5/e/1/6/d/Culex_laticinctus/genomic_data/idCulLati1/pacbio/fasta/
curationpretext:
aligner: minimap2
telomere_motif: TTAGG
hic_dir: /lustre/scratch122/tol/data/a/5/e/1/6/d/Culex_laticinctus/genomic_data/idCulLati2/hic-arima2/
btk:
taxid: 1464561
lineages: "insecta_odb10"
gca_accession: GCA_0001
nt_database: /data/blastdb/Supported/NT/current
nt_database_prefix: nt
diamond_uniprot_database_path: /lustre/scratch123/tol/resources/uniprot_reference_proteomes/latest/reference_proteomes.dmnd
diamond_nr_database_path: /lustre/scratch123/tol/resources/nr/latest/nr.dmnd
ncbi_taxonomy_path: /lustre/scratch123/tol/resources/taxonomy/latest/new_taxdump/
ncbi_rankedlineage_path: /lustre/scratch123/tol/resources/taxonomy/latest/new_taxdump/rankedlineage.dmp
btk_yaml: /nfs/users/nfs_d/dp24/sanger-tol-ear/assets/btk_draft.yaml
config: /nfs/treeoflife-01/teams/tola/users/dp24/ear/conf/sanger-tol-btk.config
45 changes: 45 additions & 0 deletions assets/real_pdf.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# SAMPLE INFORMATION
ToLID: idCulLati1
Species: Culex laticinctus
Sex: XX
Submitter: Michael Paulini
Affiliation: WSI
Tags: ERGA-BGE

# SEQUENCING DATA
DATA:
- PacBio HiFi: 51x
- Arima v2: 152x

# GENOME PROFILING DATA
PROFILING:
GenomeScope:
version: 2.0
results_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/genomic_data/idCulLati1/pacbio/kmer/k31/

# ASSEMBLY DATA
ASSEMBLIES:
Pre-curation:
pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|]
pri:
gfastats--nstar-report_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.fa.gz.gfastats
busco_short_summary_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.insecta_odb10.busco/short_summary.specific.insecta_odb10.out_scaffolds_final.insecta_odb10.busco.txt
merqury_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.ccs.merquryk/

Curated:
pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|, TreeVal_v1.1]
pri:
gfastats--nstar-report_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1.primary.curated.fa.gfastats
busco_short_summary_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1.primary.curated.insecta_odb10.busco/short_summary.specific.insecta_odb10.idCulLati1.1.primary.curated.insecta_odb10.busco.txt
merqury_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1.primary.curated.ccs.merquryk/
hic_FullMap_png: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1_normal_snapshots/idCulLati1.1_normal_FullMap.png
hic_FullMap_link: https://tolqc.cog.sanger.ac.uk/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1_normal.pretext
blobplot_cont_png: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1_primary_curated_btk_busco.blob.circle.png

# CURATION NOTES
NOTES:
Obs_Haploid_num: 3
Obs_Sex: XX
Interventions_per_Gb: 430
Contamination_notes: "Total length of scaffolds removed: 989,717 (0.1 %)\nScaffolds removed: 1 (0.2 %)\nLargest scaffold removed: (989,717)\nFCS-GX contaminant species (number of scaffolds; total length of scaffolds):\nWolbachia endosymbiont (group B) of Melanostoma mellinum, a-proteobacteria (1; 989,717)"
Other_notes: "Chromosomes named by size"
45 changes: 45 additions & 0 deletions assets/template_pdf.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# SAMPLE INFORMATION
ToLID: <SAMPLE_ID>
Species: <LATIN_NAME>
Sex: <EXPECTED_SEX>
Submitter: <CURATOR>
Affiliation: WSI
Tags: ERGA-BGE

# SEQUENCING DATA
DATA:
- PacBio HiFi: <PACBIO_COVERAGE>
- Arima v2: <ARIMA_COVERAGE>

# GENOME PROFILING DATA
PROFILING:
GenomeScope:
version: 2.0
results_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/genomic_data/idCulLati1/pacbio/kmer/k31/

# ASSEMBLY DATA
ASSEMBLIES:
Pre-curation:
pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|]
pri:
gfastats--nstar-report_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.fa.gz.gfastats
busco_short_summary_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.insecta_odb10.busco/short_summary.specific.insecta_odb10.out_scaffolds_final.insecta_odb10.busco.txt
merqury_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.ccs.merquryk/

Curated:
pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|, TreeVal_v1.1]
pri:
gfastats--nstar-report_txt: idCulLati1.1.primary.curated.fa.gfastats
busco_short_summary_txt: short_summary.specific.insecta_odb10.idCulLati1.1.primary.curated.insecta_odb10.busco.txt
merqury_folder: <POST_CURATION_MERQURY_FOLDER>
hic_FullMap_png: <CURATION_PRETEXT_PRETEXT_MAP_PNG>
hic_FullMap_link: https://tolqc.cog.sanger.ac.uk/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1_normal.pretext
blobplot_cont_png: idCulLati1.1_primary_curated_btk_busco.blob.circle.png

# CURATION NOTES
NOTES:
Obs_Haploid_num: <OBSERVED_HAPLOID_CHROMOSOME_COUNT>
Obs_Sex: <OBSERVED_SEX>
Interventions_per_Gb: <MANUAL_INTERVENTIONS_PER_GB>
Contamination_notes: "Total length of scaffolds removed: 989,717 (0.1 %)\nScaffolds removed: 1 (0.2 %)\nLargest scaffold removed: (989,717)\nFCS-GX contaminant species (number of scaffolds; total length of scaffolds):\nWolbachia endosymbiont (group B) of Melanostoma mellinum, a-proteobacteria (1; 989,717)"
Other_notes: "Chromosomes named by size"
1 change: 1 addition & 0 deletions assets/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ reference_hap2: /nfs/treeoflife-01/teams/tola/users/dp24/ascc/asccTinyTest_V2/as
longread:
type: hifi
dir: /lustre/scratch123/tol/resources/treeval/treeval-testdata/TreeValSmallData/Oscheius_DF5033/genomic_data/nxOscSpes1/pacbio/fasta/
mapped_bam: idCulLati1/mapped_bam.bam
curationpretext:
aligner: minimap2
telomere_motif: TTAGG
Expand Down
4 changes: 4 additions & 0 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@ process {
maxRetries = 1
maxErrors = '-1'

withName: "SANGER_TOL_CPRETEXT|SANGER_TOL_BTK" {
time = { check_max( 70.h * task.attempt, 'time' ) }
}

// Process-specific resource requirements
// NOTE - Please try and re-use the labels below as much as possible.
// These labels are used and recognised by default in DSL2 files hosted on nf-core/modules.
Expand Down
30 changes: 21 additions & 9 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -12,30 +12,42 @@

process {

publishDir = [
path: { "${params.outdir}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
withName: "GFASTATS|MERQURYFK_MERQURYFK|SANGER_TOL_BTK|SANGER_TOL_CPRETEXT" {
publishDir = [
path: { "${params.outdir}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: GFASTATS {
ext.args = '--nstar-report'
ext.args = '--nstar-report'
}

withName: MERQURYFK_MERQURYFK {
ext.args = "-P."
ext.args = "-P."
}

withName: SAMTOOLS_SORT {
ext.prefix = { "${meta.id}_sorted"}
ext.prefix = { "${meta.id}_sorted"}
}

withName: SANGER_TOL_BTK {
ext.args = "--blastx_outext 'txt'"
ext.pipeline_name = "sanger-tol/blobtoolkit"
ext.args = ""
ext.executor = "bsub -Is -tty -e test.e -o test.log -n 2 -q oversubscribed -M1400 -R'select[mem>1400] rusage[mem=1400] span[hosts=1]'"
ext.profiles = "singularity,sanger"
ext.get_versions = "lsid | head -n1 | cut -d ',' -f 1"
ext.version = "draft_assemblies"
}

withName: SANGER_TOL_CPRETEXT {
ext.pipeline_name = "sanger-tol/curationpretext"
ext.args = ""
ext.executor = "bsub -Is -tty -e test.e -o test.log -n 2 -q oversubscribed -M1400 -R'select[mem>1400] rusage[mem=1400] span[hosts=1]'"
ext.profiles = "singularity,sanger"
ext.get_versions = "lsid | head -n1 | cut -d ',' -f 1"
ext.version = "1.0.0"
}

}
7 changes: 7 additions & 0 deletions conf/sanger-tol-btk.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
process {
withLabel:RUN_BLASTN:BLASTN_TAXON {
cpus = { check_max( 12 * task.attempt, 'cpus' ) }
memory = { check_max( 10.GB * task.attempt, 'memory' ) }
time = { check_max( 16.h * task.attempt, 'time' ) }
}
}
Loading

0 comments on commit 844c575

Please sign in to comment.