Skip to content

Commit

Permalink
Merge branch 'support-assemblies' into 'master'
Browse files Browse the repository at this point in the history
Support processing of assemblies + clean up folders after correct execution

See merge request tron/covigator-ngs-pipeline!5
  • Loading branch information
Pablo Riesgo Ferreiro committed Jun 21, 2021
2 parents a0f51be + c80c422 commit 4020625
Show file tree
Hide file tree
Showing 9 changed files with 933 additions and 270 deletions.
4 changes: 3 additions & 1 deletion .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@ image: openjdk:8u292-jre-buster

before_script:
- java -version
- apt-get update && apt-get --assume-yes install wget make procps
- apt-get update && apt-get --assume-yes install wget make procps software-properties-common
#- apt-get --assume-yes install python3 python3-pip
#- pip3 install biopython==1.76
- wget -qO- https://get.nextflow.io | bash && cp nextflow /usr/local/bin/nextflow
- nextflow help
- wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
Expand Down
10 changes: 8 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ clean:

test:
nextflow main.nf --help
nextflow main.nf -profile test,conda --initialize
nextflow main.nf -profile test,conda --name ERR4145453 \
--output output/test1 \
--fastq1 test_data/ERR4145453_1.fastq.gz \
Expand All @@ -20,6 +21,10 @@ test:
--fastq1 test_data/ERR4145453_1.fastq.gz \
--fastq2 test_data/ERR4145453_2.fastq.gz \
--keep_intermediate
nextflow main.nf -profile test,conda --name hCoV-19_NTXX \
--output output/test4 \
--fasta test_data/hCoV-19_NTXX.fasta
#python3 -m unittest bin/test_assembly_variant_caller.py

check:
test -s output/test1/ERR4145453/ERR4145453.bcftools.normalized.annotated.vcf.gz || { echo "Missing test 1 VCF output file!"; exit 1; }
Expand All @@ -38,5 +43,6 @@ check:
test -s output/test3/ERR4145453/ERR4145453.gatk.normalized.annotated.vcf.gz || { echo "Missing test 3 VCF output file!"; exit 1; }
test -s output/test3/ERR4145453/ERR4145453.lofreq.normalized.annotated.vcf.gz || { echo "Missing test 3 VCF output file!"; exit 1; }
test -s output/test3/ERR4145453/ERR4145453.ivar.tsv || { echo "Missing test 3 VCF output file!"; exit 1; }
test -s output/test3/ERR4145453/ERR4145453.fastp_stats.json || { echo "Missing test 2 VCF output file!"; exit 1; }
test -s output/test3/ERR4145453/ERR4145453.fastp_stats.html || { echo "Missing test 2 VCF output file!"; exit 1; }
test -s output/test3/ERR4145453/ERR4145453.fastp_stats.json || { echo "Missing test 3 VCF output file!"; exit 1; }
test -s output/test3/ERR4145453/ERR4145453.fastp_stats.html || { echo "Missing test 3 VCF output file!"; exit 1; }
test -s output/test4/hCoV-19_NTXX/hCoV-19_NTXX.assembly.normalized.annotated.vcf.gz || { echo "Missing test 4 VCF output file!"; exit 1; }
49 changes: 39 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
# Covigator NGS pipeline
# Covigator pipeline

[![DOI](https://zenodo.org/badge/374669617.svg)](https://zenodo.org/badge/latestdoi/374669617)

The Covigator NGS pipeline process SARS-CoV-2 FASTQ files into analysis ready VCF files. The pipeline is implemented in the Nextflow framework (Di Tommaso, 2017).
The Covigator pipeline process SARS-CoV-2 FASTQ or FASTA files into annotated and normalized analysis ready VCF files.
The pipeline is implemented in the Nextflow framework (Di Tommaso, 2017).

The pipeline includes the following steps:
When FASTQ files are provided the pipeline includes the following steps:
- **Trimming**. `fastp` is used to trim reads with default values.
- **Alignment**. `BWA mem` is used for the alignment of single or paired end samples.
- **BAM preprocessing**. BAM files are prepared and duplicate reads are marked using GATK and Picard tools.
Expand All @@ -13,7 +14,20 @@ The pipeline includes the following steps:
- **Variant normalization**. `bcftools norm` and `vt` tools are employed to left align indels, trim variant calls and remove variant duplicates.
- **Variant consequence annotation**. `SnpEff` is employed to annotate the variant consequences of variants.

The alignment, BAM preprocessing and variant normalization pipelines were implemented in additional Nextflow pipelines within the TronFlow initiative.
Both single end and paired end FASTQ files are supported.

When a FASTA file is provided with a single assembly sequence the pipeline includes the following steps:
- **Variant calling**. A Smith-Waterman global alignment is performed against the reference sequence to call SNVs and
indels. Indels longer than 50 bp and at the beginning or end of the assembly sequence are excluded. Any mutation where
either reference or assembly contain a N is excluded.
- **Variant normalization**. `bcftools norm` and `vt` tools are employed to left align indels, trim variant calls and remove variant duplicates.
- **Variant consequence annotation**. `SnpEff` is employed to annotate the variant consequences of variants.

The FASTA file is expected to contain a single assembly sequence.
Bear in mind that only clonal variants can be called on the assembly.

The alignment, BAM preprocessing and variant normalization pipelines were implemented in additional Nextflow pipelines
within the TronFlow initiative.
The full details are available in their respective repositories:
- https://github.com/TRON-Bioinformatics/tronflow-bwa (https://doi.org/10.5281/zenodo.4722852)
- https://github.com/TRON-Bioinformatics/tronflow-bam-preprocessing (https://doi.org/10.5281/zenodo.4810918)
Expand Down Expand Up @@ -44,17 +58,27 @@ variants with a VAF < 20 % are considered `LOW_FREQUENCY` and variants with a VA

## How to run it

If you are going to use it with the conda environments, first initialize the environments by running:
```
nextflow main.nf -profile conda --initialize
```

This will create the necessary conda environments under `work/conda`.
This initialization is required under every work folder when using more than one variant caller.

Then run the application as follows:
```
$ nextflow run tron-bioinformatics/covigator-ngs-pipeline -profile conda --help
Usage:
nextflow run tron-bioinformatics/covigator-ngs-pipeline -profile conda --help
Input:
* --fastq1: the first input FASTQ file
* --fastq1: the first input FASTQ file (not compatible with --fasta)
* --fasta: the FASTA file containing the assembly sequence (not compatible with --fastq1)
* --name: the sample name, output files will be named after this name
* --reference: the reference genome FASTA file, *.fai, *.dict and bwa indexes are required.
* --gff: the GFFv3 gene annotations file
* --gff: the GFFv3 gene annotations file (only optional with --fastq1)
* --output: the folder where to publish output
Optional input:
Expand All @@ -63,12 +87,17 @@ Optional input:
* --min_mapping_quality: minimum mapping quality to take a read into account (default: 20)
* --low_frequency_variant_threshold: VAF threshold to mark a variant as low frequency (default: 0.2)
* --subclonal_variant_threshold: VAF superior threshold to mark a variant as subclonal (default: 0.8)
* --strand_bias_threshold: threshold for the strand bias test Phred score (default: 20)
* --memory: the ammount of memory used by each job (default: 3g)
* --cpus: the number of CPUs used by each job (default: 1)
* --initialize: start the initialization of the conda environments
* -- skip_lofreq: skips calling variants with LoFreq
* -- skip_gatk: skips calling variants with GATK
* -- skip_bcftools: skips calling variants with BCFTools
* -- skip_ivar: skips calling variants with iVar
Output:
* Output a normalized, phased and annotated VCF file for each of BCFtools, GATK and LoFreq
* Output a normalized, phased and annotated VCF file for each of BCFtools, GATK and LoFreq when FASTQ files are
provided or a single VCF obtained from a global alignment when a FASTA file is provided
* Output a TSV file output from iVar
```

Expand All @@ -95,9 +124,9 @@ A workaround to this situation is to clone the tronflow dependencies and let cov
For instance:
```
cd /covigator/dependencies
git clone --branch v1.4.0 https://github.com/TRON-Bioinformatics/tronflow-bwa.git
git clone --branch v1.4.1 https://github.com/TRON-Bioinformatics/tronflow-bwa.git
git clone --branch v1.5.0 https://github.com/TRON-Bioinformatics/tronflow-bam-preprocessing.git
git clone --branch v1.1.0 https://github.com/TRON-Bioinformatics/tronflow-variant-normalization.git
git clone --branch v1.1.1 https://github.com/TRON-Bioinformatics/tronflow-variant-normalization.git
```

And then use the following parameters:
Expand Down
127 changes: 127 additions & 0 deletions bin/assembly_variant_caller.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
#!/usr/bin/env python
import os
from argparse import ArgumentParser
from dataclasses import dataclass
from Bio import Align, SeqIO
from Bio.Align import PairwiseAlignment
from typing import List

CHROMOSOME = "MN908947.3"


@dataclass
class Variant:
position: int
reference: str
alternate: str

def to_vcf_line(self):
# transform 0-based position to 1-based position
return CHROMOSOME, str(self.position + 1), ".", self.reference, self.alternate, ".", "PASS", "."


class AssemblyVariantCaller:

def call_variants(self, sequence: str, reference: str) -> List[Variant]:
alignment = self._run_alignment(sequence=sequence, reference=reference)
variants = self._call_mutations(alignment)
return variants

def _run_alignment(self, sequence: str, reference: str) -> PairwiseAlignment:
aligner = Align.PairwiseAligner()
aligner.mode = 'global'
aligner.match = 2
aligner.mismatch = -1
aligner.open_gap_score = -3
aligner.extend_gap_score = -0.1
aligner.target_end_gap_score = 0.0
aligner.query_end_gap_score = 0.0
alignments = aligner.align(reference, sequence)
return alignments[0]

def _call_mutations(self, alignment: PairwiseAlignment) -> List[Variant]:
# CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
# MN908947.3 9924 . C T 228 .
# DP=139;VDB=0.784386;SGB=-0.693147;RPB=0.696296;MQB=1;MQSB=1;BQB=0.740741;MQ0F=0;AC=1;AN=1;DP4=2,0,123,12;MQ=60
# GT:PL 1:255,0
alternate = alignment.query
reference = alignment.target

variants = []
prev_ref_end = None
prev_alt_end = None
for (ref_start, ref_end), (alt_start, alt_end) in zip(alignment.aligned[0], alignment.aligned[1]):
# calls indels
# NOTE: it does not call indels at beginning and end of sequence
if prev_ref_end is not None and prev_ref_end != ref_start:
# deletion
if ref_start - prev_ref_end <= 50: # skips deletions longer than 50 bp
ref = reference[prev_ref_end - 1: ref_start]
if 'N' not in ref: # do not call deletions with Ns
variants.append(Variant(
position=prev_ref_end - 1,
reference=ref,
alternate=reference[prev_ref_end - 1]))
elif prev_ref_end is not None and prev_alt_end != alt_start:
# insertion
if alt_start - prev_alt_end <= 50: # skips insertions longer than 50 bp
ref = reference[prev_ref_end - 1]
alt = alternate[prev_alt_end:alt_start]
if ref != 'N' and 'N' not in alt: # do not call insertions with Ns
variants.append(Variant(
position=prev_ref_end - 1,
reference=ref,
alternate=ref + alt))

# calls SNVs
for pos, ref, alt in zip(
range(ref_start, ref_end), reference[ref_start: ref_end], alternate[alt_start: alt_end]):
# contiguous SNVs are reported separately
if ref != alt and ref != 'N' and alt != 'N': # do not call SNVs on Ns
variants.append(Variant(position=pos, reference=ref, alternate=alt))

prev_ref_end = ref_end
prev_alt_end = alt_end

return variants


def write_vcf(mutations, output_vcf):
with open(output_vcf, "w") as vcf_out:
header = (
"##fileformat=VCFv4.0",
"##FILTER=<ID=PASS,Description=\"All filters passed\">",
"##contig=<ID=MN908947.3>",
"#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
)
for row in header:
vcf_out.write(row + "\n")
for row in mutations:
vcf_out.write("\t".join(row.to_vcf_line()) + "\n")


def main():
parser = ArgumentParser(description="Run Pipeline for testing")
parser.add_argument("--fasta", dest="fasta",
help="The fasta file with the query sequence. Only one sequence is expected",
required=True)
parser.add_argument("--reference", dest="reference",
help="The fasta file with the reference sequence. Only one sequence is expected",
required=True)
parser.add_argument("--output-vcf", dest="output_vcf",
help="The path to the output VCF",
required=True)
args = parser.parse_args()

assert os.path.exists(args.fasta), "Fasta file {} does not exist!".format(args.fasta)
assert os.path.exists(args.reference), "Fasta file {} does not exist!".format(args.reference)

query = next(SeqIO.parse(args.fasta, "fasta"))
reference = next(SeqIO.parse(args.reference, "fasta"))
variant_caller = AssemblyVariantCaller()
variants = variant_caller.call_variants(sequence=query.seq, reference=reference.seq)
write_vcf(mutations=variants, output_vcf=args.output_vcf)


if __name__ == '__main__':
main()
51 changes: 51 additions & 0 deletions bin/test_assembly_variant_caller.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
from unittest import TestCase

from .assembly_variant_caller import AssemblyVariantCaller


class TestCountryParser(TestCase):

def test_assembly_variant_caller(self):
caller = AssemblyVariantCaller()
# no mutations
variants = caller.call_variants(sequence="ACGTACGT", reference="ACGTACGT")
self.assertEqual(len(variants), 0)
# SNV
variants = caller.call_variants(sequence="ACGTCCGT", reference="ACGTACGT")
self.assertEqual(len(variants), 1)
snv = variants[0]
self.assertEqual(snv.reference, "A")
self.assertEqual(snv.alternate, "C")
self.assertEqual(snv.position, 4)
# deletion
variants = caller.call_variants(
reference="CTGGTGTGAGCCTGGTCACCAGGGTGGTAGGACAGACCCTCCTCTGGAGGCAAAGTGACG",
sequence="CTGGTGTGAGCCTGGTCACCAGGGTGGTAGGACAGACCCTCCTCTGGCAAAGTGACG")
self.assertEqual(len(variants), 1)
snv = variants[0]
self.assertEqual(snv.reference, "TGGA")
self.assertEqual(snv.alternate, "T")
self.assertEqual(snv.position, 44)
# insertion
variants = caller.call_variants(
sequence= "CTGGTGTGAGCCTGGTCACCAGGGTGGTAGGACAGACCCTCCTCTGCCCGAGGCAAAGTGACG",
reference="CTGGTGTGAGCCTGGTCACCAGGGTGGTAGGACAGACCCTCCTCTGGAGGCAAAGTGACG")
self.assertEqual(len(variants), 1)
snv = variants[0]
self.assertEqual(snv.reference, "G")
self.assertEqual(snv.alternate, "GCCC")
self.assertEqual(snv.position, 45)
# another insertion
variants = caller.call_variants(
sequence= "CTGGTGTGAGTCCTGGTCACCAGGGTGGTAGGACAGACCCTCCTCTGCCCGAGGCAAAGTGACG",
reference="CTGGTGTGAGCCTGGTCACCAGGGTGGTAGGACAGACCCTCCTCTGGAGGCAAAGTGACG")
self.assertEqual(len(variants), 2)
snv = variants[1]
self.assertEqual(snv.reference, "G")
self.assertEqual(snv.alternate, "GCCC")
self.assertEqual(snv.position, 45)
snv = variants[0]
self.assertEqual(snv.reference, "G")
self.assertEqual(snv.alternate, "GT")
self.assertEqual(snv.position, 9)

4 changes: 3 additions & 1 deletion environment.yml
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
# You can use this file to create a conda environment for this pipeline:
# conda env create -f environment.yml
name: covigator-ngs-pipeline
name: covigator-pipeline
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- conda-forge::python=3.8.5
- conda-forge::biopython=1.76
- bioconda::nextflow=21.04.0
- bioconda::bcftools=1.12
- bioconda::lofreq=2.1.5
Expand Down
Loading

0 comments on commit 4020625

Please sign in to comment.