Skip to content
This repository has been archived by the owner on May 30, 2024. It is now read-only.

Workflows and containers for processing COVID-19 data.

License

Notifications You must be signed in to change notification settings

DNAstack/covid-processing-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

COVID-19 variant calling pipeline

This repository contains workflows for processing SARS-CoV-2 data.

The FASTQ-based workflows produce variant calls, assembled genomes, and lineage assignments from raw sequencing reads. The FASTQ-based workflows are able to process reads originating from PacBio, Oxford Nanopore (single-end), and Illumina (paired-end) sequencing data.

An assembly-based workflow which calculates pseudo-variant sites and assigns lineage using assembled SARS-CoV-2 genomes is also included.

Docker image definitions can be explored in DNAstack's image repository, or on Dockerhub.

Workflows

FASTQ-based workflows

These workflows can be used to process FASTQ files into variant calls, assembled genomes, and lineage metadata.

Choose the workflow that corresponds to your sequencing data type.

The inputs and outputs for each of the FASTQ-based workflows is outlined in detail in the repository for that workflow, linked below (PacBio, Illumina, Oxford Nanopore). In addition to the output files specified in those repositories, each of the FASTQ-based workflows also outputs a file containing lineage metadata calculated using Pangolin for the assembled genome that is produced during workflow execution.

PacBio

This workflow uses PacBio's CoSA pipeline to process Pacific Biosciences SARS-CoV-2 long read HiFi data.

Illumina

This workflow uses the SIGNAL pipeline to process Illumina paired-end SARS-CoV-2 sequencing data.

Oxford Nanopore

This workflow uses the Connor lab's implementation of the ARTIC pipeline to process Oxford Nanopore single-ended SARS-CoV-2 sequencing data.

Assembly-based workflow

The variants from assembly workflow can be used to determine 'pseudo-variant' sites when no raw sequencing data is available. This workflow aligns the provided assembled SARS-CoV-2 genome to the reference genome, then uses snp-sites to determine sites that differ from the reference. Variant sites are output in VCF format. Viral lineage is assigned using Pangolin, as in the FASTQ-based workflows.

N.B. that since base quality information is not available when using an assembly alone to call variants, these variant sites cannot be filtered based on quality and should be used for exploratory analysis only. In addition, indels cannot be called using this method. Prefer the FASTQ-based workflows when raw sequencing data is available.

Workflow inputs

Input Description
accession Sample ID
assembly Assembled SARS-CoV-2 genome
reference_genome The SARS-CoV-2 reference genome
reference_genome_id [MN908947.3]
container_registry Registry that hosts workflow containers. All containers are hosted in DNAstack's Dockerhub [dnastack]

Workflow outputs

Output Description
vcf, vcf_index Pseudo-variant calls and index in VCF format
lineage_metadata Lineage assignment and associated metadata (tool versions etc.) output by Pangolin

Running workflows

Required software

Running using Cromwell

From the root of the repository, run:

java -jar /path/to/cromwell.jar run /path/to/workflow.wdl -i /path/to/inputs.json

Output and execution files will be located in the cromwell-executions directory. When the workflow finishes successfully, it will output JSON (to stdout) specifying the full path to each output file.

Running using miniwdl

This command assumes you have miniwdl available on your command line. If miniwdl is not available, try installing using pip install miniwdl.

miniwdl run /path/to/workflow.wdl -i /path/to/inputs.json

Output and execution files will be located in a dated directory (e.g. named 20200704_073415_main). When the workflow finishes successfully, it will output JSON (to stdout) specifying the full path to each output file.

About

Workflows and containers for processing COVID-19 data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages