Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add disambiguate tool to QC portion of the weave pipeline #33

Merged
merged 9 commits into from
Jan 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

<h1>weave 🔬</h1>

**_An awesome metagenomic and metatranscriptomics pipeline_**
**_An awesome demultiplexing and quality control pipeline_**

[![tests](https://github.com/OpenOmics/weave/workflows/tests/badge.svg)](https://github.com/OpenOmics/weave/actions/workflows/main.yaml) [![docs](https://github.com/OpenOmics/weave/workflows/docs/badge.svg)](https://github.com/OpenOmics/weave/actions/workflows/docs.yml) [![GitHub issues](https://img.shields.io/github/issues/OpenOmics/weave?color=brightgreen)](https://github.com/OpenOmics/weave/issues) [![GitHub license](https://img.shields.io/github/license/OpenOmics/weave)](https://github.com/OpenOmics/weave/blob/main/LICENSE)
[![tests](https://github.com/OpenOmics/weave/workflows/tests/badge.svg)](https://github.com/OpenOmics/weave/actions/workflows/main.yaml) [![docs](https://github.com/OpenOmics/weave/actions/workflows/doc.yml/badge.svg)](https://github.com/OpenOmics/weave/actions/workflows/docs.yml) [![GitHub issues](https://img.shields.io/github/issues/OpenOmics/weave?color=brightgreen)](https://github.com/OpenOmics/weave/issues) [![GitHub license](https://img.shields.io/github/license/OpenOmics/weave)](https://github.com/OpenOmics/weave/blob/main/LICENSE)

<i>
This is the home of the pipeline, weave. Its long-term goals: to provide accurate quantification, taxonomic classification, and functional profiling of assembled (bacteria and archaea) metagenomes!
Expand Down
6 changes: 3 additions & 3 deletions config/bigsky.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,17 @@
"mounts": {
"kaiju": {
"to": "/opt/kaiju",
"from": "/gs1/RTS/OpenOmics/references/Dmux/kaiju/kaiju_db_nr_euk_2023-05-10",
"from": "/gs1/RTS/OpenOmics/references/weave/kaiju/kaiju_db_nr_euk_2023-05-10",
"mode": "ro"
},
"kraken2" : {
"to": "/opt/kraken2",
"from": "/gs1/RTS/OpenOmics/references/Dmux/kraken2/k2_pluspfp_20230605",
"from": "/gs1/RTS/OpenOmics/references/weave/kraken2/k2_pluspfp_20230605",
"mode": "ro"
},
"fastq_screen" : {
"to": "/fdb/fastq_screen/FastQ_Screen_Genomes",
"from": "/gs1/RTS/OpenOmics/references/Dmux/FastQ_Screen_Genomes",
"from": "/gs1/RTS/OpenOmics/references/weave/FastQ_Screen_Genomes",
"mode": "ro"
}
}
Expand Down
4 changes: 2 additions & 2 deletions config/biowulf.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@
"mounts": {
"kaiju": {
"to": "/opt/kaiju",
"from": "/data/OpenOmics/references/Dmux/kaiju/kaiju_db_nr_euk_2023-05-10",
"from": "/data/OpenOmics/references/weave/kaiju/kaiju_db_nr_euk_2023-05-10",
"mode": "ro"
},
"kraken2" : {
"to": "/opt/kraken2",
"from": "/data/OpenOmics/references/Dmux/kraken2/k2_pluspfp_20230605",
"from": "/data/OpenOmics/references/weave/kraken2/k2_pluspfp_20230605",
"mode": "ro"
},
"fastq_screen" : {
Expand Down
2 changes: 2 additions & 0 deletions config/remote.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
{
"bcl2fastq": "docker://umccr/bcl2fastq:latest",
"weave": "docker://rroutsong/weave_ngsqc:0.0.1",
"bclconvert": "docker://rroutsong/weave_bclconvert:0.0.3",
"disambiguate": "docker://quay.io/biocontainers/ngs-disambiguate:2018.05.03--hf393df8_7",
"kraken": "https://genome-idx.s3.amazonaws.com/kraken/k2_pluspfp_16gb_20231009.tar.gz",
"kaiju": "https://kaiju-idx.s3.eu-central-1.amazonaws.com/2023/kaiju_db_nr_euk_2023-05-10.tgz",
"fastq_screen": "filelist://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/genome_locations.txt"
Expand Down
5 changes: 4 additions & 1 deletion docker/NGS_QC/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,13 @@ ENV TZ=America/New_York
ENV DEBIAN_FRONTEND=noninteractive
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
RUN apt-get update; apt-get upgrade -y bash && \
apt-get install -y fastqc multiqc fastp kraken2 bowtie2 libgd-graph-perl git curl libz-dev build-essential
apt-get install -y fastqc python-is-python3 rna-star bwa hisat2 python3-pip samtools multiqc fastp kraken2 bowtie2 libgd-graph-perl git curl libz-dev build-essential
RUN mkdir /opt2; curl -S -s -L https://github.com/StevenWingett/FastQ-Screen/archive/refs/tags/v0.15.3.tar.gz --output /opt2/v0.15.3.tar.gz
RUN cd /opt2 && tar xvf v0.15.3.tar.gz
RUN cd /opt2 && git clone https://github.com/bioinformatics-centre/kaiju.git
RUN cd /opt2/kaiju/src; make
RUN cd /opt2 && git clone https://github.com/AstraZeneca-NGS/disambiguate.git
RUN pip install pysam
RUN chmod +x /opt2/disambiguate/disambiguate.py; ln -sf /opt2/disambiguate/disambiguate.py /usr/bin/disambiguate.py
rroutsong marked this conversation as resolved.
Show resolved Hide resolved
ENV PATH="${PATH}:/opt2/FastQ-Screen-0.15.3:/opt2/kaiju/bin"
ADD docker/NGS_QC/fastq_screen.conf /etc/fastq_screen.conf
38 changes: 38 additions & 0 deletions scripts/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,4 +158,42 @@ def get_bigsky_seq_dirs():
"seq": get_biowulf_seq_dirs(),
"profile": Path(Path(__file__).parent.parent, "utils", "profiles", "biowulf").resolve(),
}
}


GENOME_CONFIGS = {
"biowulf": {
"hg19": "/vf/users/OpenOmics/references/genomes/human/hg19/GRCh37.primary_assembly.genome.fa.gz",
"grch37": "/vf/users/OpenOmics/references/genomes/human/GRCh37/GRCh37.primary_assembly.genome.fa.gz",
"hg38": "/vf/users/OpenOmics/references/genomes/human/hg38/GRCh38.p14.genome.fa.gz",
"grch38": "/vf/users/OpenOmics/references/genomes/human/GRCh38/GRCh38.p14.genome.fa.gz",
"mm9": "/vf/users/OpenOmics/references/genomes/mouse/mm9/mm9.fa.gz",
"grcm37": "/vf/users/OpenOmics/references/genomes/mouse/GRCm37/mm9.fa.gz",
"mm10": "/vf/users/OpenOmics/references/genomes/mouse/mm10/GRCm38.p4.genome.fa.gz",
"grcm38": "/vf/users/OpenOmics/references/genomes/mouse/GRCm38/GRCm38.p4.genome.fa.gz",
"mm39": "/vf/users/OpenOmics/references/genomes/mouse/mm39/GRCm39.genome.fa.gz",
"grcm39": "/vf/users/OpenOmics/references/genomes/mouse/GRCm39/GRCm3s9.genome.fa.gz",
"rhemac10": "/vf/users/OpenOmics/references/genomes/rhesus/rhe10/rheMac10.fa.gz",
"mmul10": "/vf/users/OpenOmics/references/genomes/rhesus/mmul10/rheMac10.fa.gz",
"agm": "/vf/users/OpenOmics/references/genomes/agm/1.1/GCF_000409795.2_Chlorocebus_sabeus_1.1_genomic.fna.gz",
"mesaur": "/vf/users/OpenOmics/references/genomes/mesaur/2.0/GCF_017639785.1_BCM_Maur_2.0_genomic.fna.gz",
"cynomac": "/vf/users/OpenOmics/references/genomes/cynomac/v2/GCF_012559485.2_MFA1912RKSv2_genomic.fna.gz",
},
"bigsky": {
"hg19": "/gs1/RTS/OpenOmics/references/genomes/human/hg19/GRCh37.primary_assembly.genome.fa.gz",
"grch37": "/gs1/RTS/OpenOmics/references/genomes/human/GRCh37/GRCh37.primary_assembly.genome.fa.gz",
"hg38": "/gs1/RTS/OpenOmics/references/genomes/human/hg38/GRCh38.p14.genome.fa.gz",
"grch38": "/gs1/RTS/OpenOmics/references/genomes/human/GRCh38/GRCh38.p14.genome.fa.gz",
"mm9": "/gs1/RTS/OpenOmics/references/genomes/mouse/mm9/mm9.fa.gz",
"grcm37": "/gs1/RTS/OpenOmics/references/genomes/mouse/GRCm37/mm9.fa.gz",
"mm10": "/gs1/RTS/OpenOmics/references/genomes/mouse/mm10/GRCm38.p4.genome.fa.gz",
"grcm38": "/gs1/RTS/OpenOmics/references/genomes/mouse/GRCm38/GRCm38.p4.genome.fa.gz",
"mm39": "/gs1/RTS/OpenOmics/references/genomes/mouse/mm39/GRCm39.genome.fa.gz",
"grcm39": "/gs1/RTS/OpenOmics/references/genomes/mouse/GRCm39/GRCm39.genome.fa.gz",
"rhemac10": "/gs1/RTS/OpenOmics/references/genomes/rhesus/rhe10/rheMac10.fa.gz",
"mmul10": "/gs1/RTS/OpenOmics/references/genomes/rhesus/mmul10/rheMac10.fa.gz",
"agm": "/gs1/RTS/OpenOmics/references/genomes/agm/1.1/GCF_000409795.2_Chlorocebus_sabeus_1.1_genomic.fna.gz",
"mesaur": "/gs1/RTS/OpenOmics/references/genomes/mesaur/2.0/GCF_017639785.1_BCM_Maur_2.0_genomic.fna.gz",
"cynomac": "/gs1/RTS/OpenOmics/references/genomes/cynomac/v2/GCF_012559485.2_MFA1912RKSv2_genomic.fna.gz",
},
}
30 changes: 28 additions & 2 deletions scripts/files.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@
# ~~~~~~~~~~~~~~~
# file system helper functions for the Dmux software package
# ~~~~~~~~~~~~~~~
from pathlib import Path
import xml.etree.ElementTree as ET
from pathlib import Path
from os import access as check_access, R_OK, W_OK
from functools import partial
from .samplesheet import IllumniaSampleSheet
from .config import get_current_server, LABKEY_CONFIGS, DIRECTORY_CONFIGS
from .config import get_current_server, GENOME_CONFIGS, DIRECTORY_CONFIGS


def get_all_seq_dirs(top_dir, server):
Expand Down Expand Up @@ -48,6 +48,32 @@ def valid_run_output(output_directory, dry_run=False):
return output_directory


def valid_fasta(suspect):
server_genomes = GENOME_CONFIGS[get_current_server()]
is_valid = False
if suspect.lower() in server_genomes:
is_valid = True
suspect = server_genomes[suspect.lower()]
else:
the_suspect = Path(suspect)
exts = the_suspect.suffixes
if any([
'.fna' in exts,
'.fa' in exts,
'.fasta' in exts,
'.fna' in exts and '.gz' in exts,
'.fa' in exts and '.gz' in exts,
'.fasta' in exts and '.gz' in exts,
]):
is_valid = True
suspect = str(Path(suspect).absolute())

if not is_valid:
raise ValueError

return suspect


def get_all_staged_dirs(top_dir, server):
return list(filter(partial(is_dir_staged, server), get_all_seq_dirs(top_dir, server)))

Expand Down
42 changes: 40 additions & 2 deletions scripts/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,15 @@
import os
import yaml
import sys
import textwrap
from argparse import ArgumentTypeError
from dateutil.parser import parse as date_parser
from subprocess import Popen, PIPE, STDOUT
from pathlib import Path, PurePath

# ~~~ internals ~~~
from .files import parse_samplesheet, mk_or_pass_dirs
from .config import SNAKEFILE, DIRECTORY_CONFIGS, get_current_server, get_resource_config
from .config import SNAKEFILE, DIRECTORY_CONFIGS, GENOME_CONFIGS, get_current_server, get_resource_config


host = get_current_server()
Expand All @@ -40,6 +41,19 @@ def default(self, obj):
return super(self).default(obj)


def get_alias_table():
return textwrap.dedent("""Genome short name alias table:
+----------------+-------------------------------------------+
| Organism | Genomes supported (aka) |
+----------------+-------------------------------------------+
| HUMAN | hg19(grch37) / hg38(grch38) |
+----------------+-------------------------------------------+
| MOUSE | mm9(grcm37) / mm10(grcm38) / mm39(grcm39) |
+----------------+-------------------------------------------+
| RHESUS MACAQUE | RHEMAC10(mmul10) |
+----------------+-------------------------------------------+""")


def valid_runid(id_to_check):
'''
Given an input ID get it's validity against the run id format:
Expand Down Expand Up @@ -233,6 +247,9 @@ def exec_pipeline(configs, dry_run=False, local=False):
if not bclcon_log_dir.exists():
bclcon_log_dir.mkdir(mode=0o755, parents=True)
extra_to_mount.append(str(bclcon_log_dir) + ":" + "/var/log/bcl-convert:rw")
if this_config.get('disambiguate', False):
extra_to_mount.append(Path(this_config['host_genome']).parent)
extra_to_mount.append(Path(this_config['pathogen_genome']).parent)
singularity_binds = get_mounts(*extra_to_mount)
config_file = Path(this_config['out_to'], '.config', f'config_job_{str(i)}.json').absolute()
json.dump(this_config, open(config_file, 'w'), cls=PathJSONEncoder, indent=4)
Expand Down Expand Up @@ -275,4 +292,25 @@ def is_bclconvert(samplesheet):
check = False
if samplesheet.instrument in BCLCONVERT_INSTRUMENTS or samplesheet.platform in BCLCONVERT_PLATFORMS:
check = True
return check
return check


def valid_host_pathogen_genomes(host, pathogen):
g1, g2 = False, False
genomes = GENOME_CONFIGS[get_current_server()]

if Path(host).absolute().exists():
g1 = True
host = str(Path(host).absolute())

if Path(pathogen).absolute().exists():
g2 = True
pathogen = str(Path(pathogen).absolute())

if not all([g1, g2]):
if not g1:
raise ValueError('Host genome does not exist on the file system.')
if not g2:
raise ValueError('Pathogen genome does not exist on the file system.')

return host, pathogen
49 changes: 33 additions & 16 deletions weave
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,22 @@ def run(args):
exec_config['bcl_files'].append(bcls)
exec_config['demux_data'].append(files.check_if_demuxed(rundir))

# ~~~ disambiguate genome configuration ~~~
if all([args.host, args.pathogen]):
if 'disambiguate' not in exec_config:
exec_config['disambiguate'] = []
if 'host_genome' not in exec_config:
exec_config['host_genome'] = []
if 'pathogen_genome' not in exec_config:
exec_config['pathogen_genome'] = []

utils.valid_host_pathogen_genomes(args.host, args.pathogen)
exec_config['disambiguate'].append(True)
exec_config['host_genome'].append(args.host)
exec_config['pathogen_genome'].append(args.pathogen)
else:
assert(not any([args.host, args.pathogen])), 'Must specify both host and pathogen genometype!'

# ~~~ QC/QA configuration ~~~
exec_config['bclconvert'].append(utils.is_bclconvert(sample_sheet))
exec_config['run_ids'].append(rundir.name)
Expand Down Expand Up @@ -67,41 +83,42 @@ def unlock_dir(sub_args):

# ~~~~ argument parsing commands ~~~~
if __name__ == '__main__':
main_parser = argparse.ArgumentParser(prog='weave')
main_parser = argparse.ArgumentParser(prog='weave', formatter_class=argparse.RawTextHelpFormatter, epilog=utils.get_alias_table())
sub_parsers = main_parser.add_subparsers(help='run subcommand help')

parser_run = sub_parsers.add_parser('run')
parser_run = sub_parsers.add_parser('run', formatter_class=argparse.RawTextHelpFormatter, epilog=utils.get_alias_table())
parser_run.add_argument('rundir', metavar='<run directory>', nargs="+", type=utils.valid_run_input,
help='Full & complete run id (format YYMMDD_INSTRUMENTID_TIME_FLOWCELLID) or absolute paths')
help='Full & complete run id (format YYMMDD_INSTRUMENTID_TIME_FLOWCELLID) or absolute paths.')
parser_run.add_argument('-s', '--seq_dir', metavar='<sequencing directory>', default=None, type=str,
help='Root directory for sequencing data (defaults for biowulf/bigsky/locus), must contain directories ' + \
'matching run ids, if not using full paths.')
parser_run.add_argument('-o', '--output', metavar='<output directory>', default=None, type=str,
help='Top-level output directory for demultiplexing data (defaults to input directory + runid + "_demux")')
help='Top-level output directory for demultiplexing data (defaults to input directory + runid + "_demux").')
parser_run.add_argument('-d', '--dry-run', action='store_true',
help='Dry run the demultiplexing workflow')
help='Dry run the demultiplexing workflow.')
parser_run.add_argument('-n', '--noqc', action='store_false',
help='Do not run the QC/QA portion of the workflow (Default is on)')
help='Do not run the QC/QA portion of the workflow (Default is on).')
parser_run.add_argument('--sheetname', metavar='Sample Sheet Filename',
help='Name of the sample sheet file to look for (default is SampleSheet.csv)')
help='Name of the sample sheet file to look for (default is SampleSheet.csv).')
parser_run.add_argument('-l', '--local', action='store_true',
help='Execute pipeline locally without a dispatching executor')
# force endedness flags
endedness = parser_run.add_mutually_exclusive_group()
endedness.add_argument('--single_end', default=None, action='store_true',
help='Declare endedness of run (in cases in which it is not detectable from sample sheet), mutally exclusive to paired_end')
endedness.add_argument('--paired_end', default=None, action='store_true',
help='Declare endedness of run (in cases in which it is not detectable from sample sheet), mutally exclusive to single_end')
help='Execute pipeline locally without a dispatching executor.')

# disambiguate arguments
parser_run.add_argument('-t', '--host', type=files.valid_fasta, default=None,
help='Full path to host genome for disambiguate to use or short name genome alias.')

parser_run.add_argument('-p', '--pathogen', type=files.valid_fasta, default=None,
help='Full path to pathogen/graft/parasite genome for disambiguate to use or short name genome alias.')

parser_cache = sub_parsers.add_parser('cache')
parser_cache.add_argument('cachedir', metavar='<cache directory>', type=cache.valid_dir,
help='Relative or absolute path to directory for cache storage')
help='Relative or absolute path to directory for cache storage.')
parser_cache.add_argument('-l', '--local', action='store_true',
help='Execute pipeline locally without a dispatching executor')

parser_unlock = sub_parsers.add_parser('unlock')
parser_unlock.add_argument('unlockdir', metavar='<directory to unlock>', type=cache.valid_dir,
help='Relative or absolute path to directory for cache storage')
help='Full path to directory to unlock.')

parser_run.set_defaults(func = run)
parser_cache.set_defaults(func = get_cache)
Expand Down
9 changes: 9 additions & 0 deletions workflow/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,15 @@ else:
if config["runqc"]:
all_outputs.extend(qa_qc_outputs)

if config.get('disambiguate', False):
all_outputs.extend(flatten([
expand("{out_to}/{project}/{sids}/disambiguate/{sids}.ambiguousSpeciesA.bam", **demux_expand_args),
expand("{out_to}/{project}/{sids}/disambiguate/{sids}.ambiguousSpeciesB.bam", **demux_expand_args),
expand("{out_to}/{project}/{sids}/disambiguate/{sids}.disambiguatedSpeciesA.bam", **demux_expand_args),
expand("{out_to}/{project}/{sids}/disambiguate/{sids}.disambiguatedSpeciesB.bam", **demux_expand_args),
expand("{out_to}/{project}/{sids}/disambiguate/{sids}_summary.txt", **demux_expand_args),
]))


rule all:
input:
Expand Down
Loading
Loading