Merge branch 'support-assemblies' into 'master'

Support processing of assemblies + clean up folders after correct execution See merge request tron/covigator-ngs-pipeline!5
TRON-Bioinformatics · Jun 21, 2021 · 4020625 · 4020625
2 parents a0f51be + c80c422
commit 4020625
Show file tree

Hide file tree

Showing 9 changed files with 933 additions and 270 deletions.
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
@@ -3,7 +3,9 @@ image: openjdk:8u292-jre-buster
 
 before_script:
   - java -version
-  - apt-get update && apt-get --assume-yes install wget make procps
+  - apt-get update && apt-get --assume-yes install wget make procps software-properties-common
+  #- apt-get --assume-yes install python3 python3-pip
+  #- pip3 install biopython==1.76
   - wget -qO- https://get.nextflow.io | bash && cp nextflow /usr/local/bin/nextflow
   - nextflow help
   - wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

diff --git a/Makefile b/Makefile
@@ -8,6 +8,7 @@ clean:
 
 test:
 	nextflow main.nf --help
+	nextflow main.nf -profile test,conda --initialize
 	nextflow main.nf -profile test,conda --name ERR4145453 \
 	--output output/test1 \
 	--fastq1 test_data/ERR4145453_1.fastq.gz \
@@ -20,6 +21,10 @@ test:
 	--fastq1 test_data/ERR4145453_1.fastq.gz \
 	--fastq2 test_data/ERR4145453_2.fastq.gz \
 	--keep_intermediate
+	nextflow main.nf -profile test,conda --name hCoV-19_NTXX \
+	--output output/test4 \
+	--fasta test_data/hCoV-19_NTXX.fasta
+	#python3 -m unittest bin/test_assembly_variant_caller.py
 
 check:
 	test -s output/test1/ERR4145453/ERR4145453.bcftools.normalized.annotated.vcf.gz || { echo "Missing test 1 VCF output file!"; exit 1; }
@@ -38,5 +43,6 @@ check:
 	test -s output/test3/ERR4145453/ERR4145453.gatk.normalized.annotated.vcf.gz || { echo "Missing test 3 VCF output file!"; exit 1; }
 	test -s output/test3/ERR4145453/ERR4145453.lofreq.normalized.annotated.vcf.gz || { echo "Missing test 3 VCF output file!"; exit 1; }
 	test -s output/test3/ERR4145453/ERR4145453.ivar.tsv || { echo "Missing test 3 VCF output file!"; exit 1; }
-	test -s output/test3/ERR4145453/ERR4145453.fastp_stats.json || { echo "Missing test 2 VCF output file!"; exit 1; }
-	test -s output/test3/ERR4145453/ERR4145453.fastp_stats.html || { echo "Missing test 2 VCF output file!"; exit 1; }
+	test -s output/test3/ERR4145453/ERR4145453.fastp_stats.json || { echo "Missing test 3 VCF output file!"; exit 1; }
+	test -s output/test3/ERR4145453/ERR4145453.fastp_stats.html || { echo "Missing test 3 VCF output file!"; exit 1; }
+	test -s output/test4/hCoV-19_NTXX/hCoV-19_NTXX.assembly.normalized.annotated.vcf.gz || { echo "Missing test 4 VCF output file!"; exit 1; }
diff --git a/README.md b/README.md
@@ -1,10 +1,11 @@
-# Covigator NGS pipeline
+# Covigator pipeline
 
 [![DOI](https://zenodo.org/badge/374669617.svg)](https://zenodo.org/badge/latestdoi/374669617)
 
-The Covigator NGS pipeline process SARS-CoV-2 FASTQ files into analysis ready VCF files. The pipeline is implemented in the Nextflow framework (Di Tommaso, 2017).
+The Covigator pipeline process SARS-CoV-2 FASTQ or FASTA files into annotated and normalized analysis ready VCF files. 
+The pipeline is implemented in the Nextflow framework (Di Tommaso, 2017).
 
-The pipeline includes the following steps:
+When FASTQ files are provided the pipeline includes the following steps:
 - **Trimming**. `fastp` is used to trim reads with default values.
 - **Alignment**. `BWA mem` is used for the alignment of single or paired end samples.
 - **BAM preprocessing**. BAM files are prepared and duplicate reads are marked using GATK and Picard tools.
@@ -13,7 +14,20 @@ The pipeline includes the following steps:
 - **Variant normalization**. `bcftools norm` and `vt` tools are employed to left align indels, trim variant calls and remove variant duplicates.
 - **Variant consequence annotation**. `SnpEff` is employed to annotate the variant consequences of variants.
 
-The alignment, BAM preprocessing and variant normalization pipelines were implemented in additional Nextflow pipelines within the TronFlow initiative. 
+Both single end and paired end FASTQ files are supported.
+
+When a FASTA file is provided with a single assembly sequence the pipeline includes the following steps:
+- **Variant calling**. A Smith-Waterman global alignment is performed against the reference sequence to call SNVs and 
+  indels. Indels longer than 50 bp and at the beginning or end of the assembly sequence are excluded. Any mutation where
+  either reference or assembly contain a N is excluded.
+- **Variant normalization**. `bcftools norm` and `vt` tools are employed to left align indels, trim variant calls and remove variant duplicates.
+- **Variant consequence annotation**. `SnpEff` is employed to annotate the variant consequences of variants.
+
+The FASTA file is expected to contain a single assembly sequence. 
+Bear in mind that only clonal variants can be called on the assembly.
+
+The alignment, BAM preprocessing and variant normalization pipelines were implemented in additional Nextflow pipelines 
+within the TronFlow initiative. 
 The full details are available in their respective repositories:
 - https://github.com/TRON-Bioinformatics/tronflow-bwa (https://doi.org/10.5281/zenodo.4722852)
 - https://github.com/TRON-Bioinformatics/tronflow-bam-preprocessing (https://doi.org/10.5281/zenodo.4810918)
@@ -44,17 +58,27 @@ variants with a VAF < 20 % are considered `LOW_FREQUENCY` and variants with a VA
 
 ## How to run it
 
+If you are going to use it with the conda environments, first initialize the environments by running:
+```
+nextflow main.nf -profile conda --initialize
+```
+
+This will create the necessary conda environments under `work/conda`. 
+This initialization is required under every work folder when using more than one variant caller.
+
+Then run the application as follows:
 ```
 $ nextflow run tron-bioinformatics/covigator-ngs-pipeline -profile conda --help
 
 Usage:
     nextflow run tron-bioinformatics/covigator-ngs-pipeline -profile conda --help
 
 Input:
-    * --fastq1: the first input FASTQ file
+    * --fastq1: the first input FASTQ file (not compatible with --fasta)
+    * --fasta: the FASTA file containing the assembly sequence (not compatible with --fastq1)
     * --name: the sample name, output files will be named after this name
     * --reference: the reference genome FASTA file, *.fai, *.dict and bwa indexes are required.
-    * --gff: the GFFv3 gene annotations file
+    * --gff: the GFFv3 gene annotations file (only optional with --fastq1)
     * --output: the folder where to publish output
 
 Optional input:
@@ -63,12 +87,17 @@ Optional input:
     * --min_mapping_quality: minimum mapping quality to take a read into account (default: 20)
     * --low_frequency_variant_threshold: VAF threshold to mark a variant as low frequency (default: 0.2)
     * --subclonal_variant_threshold: VAF superior threshold to mark a variant as subclonal (default: 0.8)
-    * --strand_bias_threshold: threshold for the strand bias test Phred score (default: 20)
     * --memory: the ammount of memory used by each job (default: 3g)
     * --cpus: the number of CPUs used by each job (default: 1)
+    * --initialize: start the initialization of the conda environments
+    * -- skip_lofreq: skips calling variants with LoFreq
+    * -- skip_gatk: skips calling variants with GATK
+    * -- skip_bcftools: skips calling variants with BCFTools
+    * -- skip_ivar: skips calling variants with iVar
 
 Output:
-    * Output a normalized, phased and annotated VCF file for each of BCFtools, GATK and LoFreq
+    * Output a normalized, phased and annotated VCF file for each of BCFtools, GATK and LoFreq when FASTQ files are
+    provided or a single VCF obtained from a global alignment when a FASTA file is provided
     * Output a TSV file output from iVar
 ```
 
@@ -95,9 +124,9 @@ A workaround to this situation is to clone the tronflow dependencies and let cov
 For instance:
 ```
 cd /covigator/dependencies
-git clone --branch v1.4.0 https://github.com/TRON-Bioinformatics/tronflow-bwa.git
+git clone --branch v1.4.1 https://github.com/TRON-Bioinformatics/tronflow-bwa.git
 git clone --branch v1.5.0 https://github.com/TRON-Bioinformatics/tronflow-bam-preprocessing.git
-git clone --branch v1.1.0 https://github.com/TRON-Bioinformatics/tronflow-variant-normalization.git
+git clone --branch v1.1.1 https://github.com/TRON-Bioinformatics/tronflow-variant-normalization.git
 ```
 
 And then use the following parameters:

diff --git a/bin/assembly_variant_caller.py b/bin/assembly_variant_caller.py
@@ -0,0 +1,127 @@
+#!/usr/bin/env python
+import os
+from argparse import ArgumentParser
+from dataclasses import dataclass
+from Bio import Align, SeqIO
+from Bio.Align import PairwiseAlignment
+from typing import List
+
+CHROMOSOME = "MN908947.3"
+
+
+@dataclass
+class Variant:
+    position: int
+    reference: str
+    alternate: str
+
+    def to_vcf_line(self):
+        # transform 0-based position to 1-based position
+        return CHROMOSOME, str(self.position + 1), ".", self.reference, self.alternate, ".", "PASS", "."
+
+
+class AssemblyVariantCaller:
+
+    def call_variants(self, sequence: str, reference: str) -> List[Variant]:
+        alignment = self._run_alignment(sequence=sequence, reference=reference)
+        variants = self._call_mutations(alignment)
+        return variants
+
+    def _run_alignment(self, sequence: str, reference: str) -> PairwiseAlignment:
+        aligner = Align.PairwiseAligner()
+        aligner.mode = 'global'
+        aligner.match = 2
+        aligner.mismatch = -1
+        aligner.open_gap_score = -3
+        aligner.extend_gap_score = -0.1
+        aligner.target_end_gap_score = 0.0
+        aligner.query_end_gap_score = 0.0
+        alignments = aligner.align(reference, sequence)
+        return alignments[0]
+
+    def _call_mutations(self, alignment: PairwiseAlignment) -> List[Variant]:
+        # CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
+        # MN908947.3      9924    .       C       T       228     .
+        # DP=139;VDB=0.784386;SGB=-0.693147;RPB=0.696296;MQB=1;MQSB=1;BQB=0.740741;MQ0F=0;AC=1;AN=1;DP4=2,0,123,12;MQ=60
+        # GT:PL   1:255,0
+        alternate = alignment.query
+        reference = alignment.target
+
+        variants = []
+        prev_ref_end = None
+        prev_alt_end = None
+        for (ref_start, ref_end), (alt_start, alt_end) in zip(alignment.aligned[0], alignment.aligned[1]):
+            # calls indels
+            # NOTE: it does not call indels at beginning and end of sequence
+            if prev_ref_end is not None and prev_ref_end != ref_start:
+                # deletion
+                if ref_start - prev_ref_end <= 50:  # skips deletions longer than 50 bp
+                    ref = reference[prev_ref_end - 1: ref_start]
+                    if 'N' not in ref:  # do not call deletions with Ns
+                        variants.append(Variant(
+                            position=prev_ref_end - 1,
+                            reference=ref,
+                            alternate=reference[prev_ref_end - 1]))
+            elif prev_ref_end is not None and prev_alt_end != alt_start:
+                # insertion
+                if alt_start - prev_alt_end <= 50:  # skips insertions longer than 50 bp
+                    ref = reference[prev_ref_end - 1]
+                    alt = alternate[prev_alt_end:alt_start]
+                    if ref != 'N' and 'N' not in alt:  # do not call insertions with Ns
+                        variants.append(Variant(
+                            position=prev_ref_end - 1,
+                            reference=ref,
+                            alternate=ref + alt))
+
+            # calls SNVs
+            for pos, ref, alt in zip(
+                    range(ref_start, ref_end), reference[ref_start: ref_end], alternate[alt_start: alt_end]):
+                # contiguous SNVs are reported separately
+                if ref != alt and ref != 'N' and alt != 'N':  # do not call SNVs on Ns
+                    variants.append(Variant(position=pos, reference=ref, alternate=alt))
+
+            prev_ref_end = ref_end
+            prev_alt_end = alt_end
+
+        return variants
+
+
+def write_vcf(mutations, output_vcf):
+    with open(output_vcf, "w") as vcf_out:
+        header = (
+            "##fileformat=VCFv4.0",
+            "##FILTER=<ID=PASS,Description=\"All filters passed\">",
+            "##contig=<ID=MN908947.3>",
+            "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
+        )
+        for row in header:
+            vcf_out.write(row + "\n")
+        for row in mutations:
+            vcf_out.write("\t".join(row.to_vcf_line()) + "\n")
+
+
+def main():
+    parser = ArgumentParser(description="Run Pipeline for testing")
+    parser.add_argument("--fasta", dest="fasta",
+                        help="The fasta file with the query sequence. Only one sequence is expected",
+                        required=True)
+    parser.add_argument("--reference", dest="reference",
+                        help="The fasta file with the reference sequence. Only one sequence is expected",
+                        required=True)
+    parser.add_argument("--output-vcf", dest="output_vcf",
+                        help="The path to the output VCF",
+                        required=True)
+    args = parser.parse_args()
+
+    assert os.path.exists(args.fasta), "Fasta file {} does not exist!".format(args.fasta)
+    assert os.path.exists(args.reference), "Fasta file {} does not exist!".format(args.reference)
+
+    query = next(SeqIO.parse(args.fasta, "fasta"))
+    reference = next(SeqIO.parse(args.reference, "fasta"))
+    variant_caller = AssemblyVariantCaller()
+    variants = variant_caller.call_variants(sequence=query.seq, reference=reference.seq)
+    write_vcf(mutations=variants, output_vcf=args.output_vcf)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/bin/test_assembly_variant_caller.py b/bin/test_assembly_variant_caller.py
@@ -0,0 +1,51 @@
+from unittest import TestCase
+
+from .assembly_variant_caller import AssemblyVariantCaller
+
+
+class TestCountryParser(TestCase):
+
+    def test_assembly_variant_caller(self):
+        caller = AssemblyVariantCaller()
+        # no mutations
+        variants = caller.call_variants(sequence="ACGTACGT", reference="ACGTACGT")
+        self.assertEqual(len(variants), 0)
+        # SNV
+        variants = caller.call_variants(sequence="ACGTCCGT", reference="ACGTACGT")
+        self.assertEqual(len(variants), 1)
+        snv = variants[0]
+        self.assertEqual(snv.reference, "A")
+        self.assertEqual(snv.alternate, "C")
+        self.assertEqual(snv.position, 4)
+        # deletion
+        variants = caller.call_variants(
+            reference="CTGGTGTGAGCCTGGTCACCAGGGTGGTAGGACAGACCCTCCTCTGGAGGCAAAGTGACG",
+            sequence="CTGGTGTGAGCCTGGTCACCAGGGTGGTAGGACAGACCCTCCTCTGGCAAAGTGACG")
+        self.assertEqual(len(variants), 1)
+        snv = variants[0]
+        self.assertEqual(snv.reference, "TGGA")
+        self.assertEqual(snv.alternate, "T")
+        self.assertEqual(snv.position, 44)
+        # insertion
+        variants = caller.call_variants(
+            sequence= "CTGGTGTGAGCCTGGTCACCAGGGTGGTAGGACAGACCCTCCTCTGCCCGAGGCAAAGTGACG",
+            reference="CTGGTGTGAGCCTGGTCACCAGGGTGGTAGGACAGACCCTCCTCTGGAGGCAAAGTGACG")
+        self.assertEqual(len(variants), 1)
+        snv = variants[0]
+        self.assertEqual(snv.reference, "G")
+        self.assertEqual(snv.alternate, "GCCC")
+        self.assertEqual(snv.position, 45)
+        # another insertion
+        variants = caller.call_variants(
+            sequence= "CTGGTGTGAGTCCTGGTCACCAGGGTGGTAGGACAGACCCTCCTCTGCCCGAGGCAAAGTGACG",
+            reference="CTGGTGTGAGCCTGGTCACCAGGGTGGTAGGACAGACCCTCCTCTGGAGGCAAAGTGACG")
+        self.assertEqual(len(variants), 2)
+        snv = variants[1]
+        self.assertEqual(snv.reference, "G")
+        self.assertEqual(snv.alternate, "GCCC")
+        self.assertEqual(snv.position, 45)
+        snv = variants[0]
+        self.assertEqual(snv.reference, "G")
+        self.assertEqual(snv.alternate, "GT")
+        self.assertEqual(snv.position, 9)
+
diff --git a/environment.yml b/environment.yml
@@ -1,11 +1,13 @@
 # You can use this file to create a conda environment for this pipeline:
 #   conda env create -f environment.yml
-name: covigator-ngs-pipeline
+name: covigator-pipeline
 channels:
   - conda-forge
   - bioconda
   - defaults
 dependencies:
+  - conda-forge::python=3.8.5
+  - conda-forge::biopython=1.76
   - bioconda::nextflow=21.04.0
   - bioconda::bcftools=1.12
   - bioconda::lofreq=2.1.5