COVID-19 variant calling pipeline

This repository contains workflows for processing SARS-CoV-2 data.

The FASTQ-based workflows produce variant calls, assembled genomes, and lineage assignments from raw sequencing reads. The FASTQ-based workflows are able to process reads originating from PacBio, Oxford Nanopore (single-end), and Illumina (paired-end) sequencing data.

An assembly-based workflow which calculates pseudo-variant sites and assigns lineage using assembled SARS-CoV-2 genomes is also included.

Docker image definitions can be explored in DNAstack's image repository, or on Dockerhub.

Workflows

FASTQ-based workflows

These workflows can be used to process FASTQ files into variant calls, assembled genomes, and lineage metadata.

Choose the workflow that corresponds to your sequencing data type.

The inputs and outputs for each of the FASTQ-based workflows is outlined in detail in the repository for that workflow, linked below (PacBio, Illumina, Oxford Nanopore). In addition to the output files specified in those repositories, each of the FASTQ-based workflows also outputs a file containing lineage metadata calculated using Pangolin for the assembled genome that is produced during workflow execution.

PacBio

This workflow uses PacBio's CoSA pipeline to process Pacific Biosciences SARS-CoV-2 long read HiFi data.

Illumina

This workflow uses the SIGNAL pipeline to process Illumina paired-end SARS-CoV-2 sequencing data.

Oxford Nanopore

This workflow uses the Connor lab's implementation of the ARTIC pipeline to process Oxford Nanopore single-ended SARS-CoV-2 sequencing data.

Assembly-based workflow

The variants from assembly workflow can be used to determine 'pseudo-variant' sites when no raw sequencing data is available. This workflow aligns the provided assembled SARS-CoV-2 genome to the reference genome, then uses snp-sites to determine sites that differ from the reference. Variant sites are output in VCF format. Viral lineage is assigned using Pangolin, as in the FASTQ-based workflows.

N.B. that since base quality information is not available when using an assembly alone to call variants, these variant sites cannot be filtered based on quality and should be used for exploratory analysis only. In addition, indels cannot be called using this method. Prefer the FASTQ-based workflows when raw sequencing data is available.

Workflow inputs

Input	Description
`accession`	Sample ID
`assembly`	Assembled SARS-CoV-2 genome
`reference_genome`	The SARS-CoV-2 reference genome
`reference_genome_id`	[`MN908947.3`]
`container_registry`	Registry that hosts workflow containers. All containers are hosted in DNAstack's Dockerhub [`dnastack`]

Workflow outputs

Output	Description
`vcf`, `vcf_index`	Pseudo-variant calls and index in VCF format
`lineage_metadata`	Lineage assignment and associated metadata (tool versions etc.) output by `Pangolin`

Running workflows

Required software

Docker
Cromwell & Java (8+) OR miniwdl & python3

Running using Cromwell

From the root of the repository, run:

java -jar /path/to/cromwell.jar run /path/to/workflow.wdl -i /path/to/inputs.json

Output and execution files will be located in the cromwell-executions directory. When the workflow finishes successfully, it will output JSON (to stdout) specifying the full path to each output file.

Running using miniwdl

This command assumes you have miniwdl available on your command line. If miniwdl is not available, try installing using pip install miniwdl.

miniwdl run /path/to/workflow.wdl -i /path/to/inputs.json

Output and execution files will be located in a dated directory (e.g. named 20200704_073415_main). When the workflow finishes successfully, it will output JSON (to stdout) specifying the full path to each output file.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
workflows		workflows
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COVID-19 variant calling pipeline

Workflows

FASTQ-based workflows

PacBio

Illumina

Oxford Nanopore

Assembly-based workflow

Workflow inputs

Workflow outputs

Running workflows

Required software

Running using Cromwell

Running using miniwdl

About

Releases

Packages

Contributors 2

Languages

License

DNAstack/covid-processing-pipeline

Folders and files

Latest commit

History

Repository files navigation

COVID-19 variant calling pipeline

Workflows

FASTQ-based workflows

PacBio

Illumina

Oxford Nanopore

Assembly-based workflow

Workflow inputs

Workflow outputs

Running workflows

Required software

Running using Cromwell

Running using miniwdl

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages