The pipeline is designed to perform joint variant calling on large Plasmodium
Whole Genome Sequencing (WGS) datasets. It follows GATK
best practices and the
MalariaGEN Pf6 data-generating
methods.
Briefly, raw reads are first mapped to the human GRCh38 reference genome to
remove host reads, with the remaining reads being mapped to the Pf3D7 reference
genome (PlasmoDB_44). The mapped reads are then processed using GATK
's
MarkDuplicates
and BaseRecalibrator
tools. Following this, analysis-ready
mapped reads for each isolate are used to generate per-sample calls
(HaplotypeCaller
/GVCF mode). These per-sample calls are combined and run
through a joint-call step (GenotypeGVCFs
) to obtain unfiltered multi-sample
VCFs. A machine learning-based variant filtration strategy (VQSR via GATK
's
VariantRecalibrator
), or/and hard-filteration strategy, can then be used to
retain high-quality variants.
The main
branch of the repository is employed for Plasmodium falciparum.
However, the pipeline is expected to work with other Plasmodium species,
provided the corresponding configuration and reference files are given (See
nextflow.config). A separate vivax
branch with configuration and reference
files for Plasmodium vivax under development and can be checked with the
link.
The pipeline has been tested on MacOS and Linux system. The software
dependencies (including the version numbers of used software) are defined within the
env/nf.yaml
Conda recipe (for Nextflow engine) and the env/snp_call_nf.yaml
Conda recipe (for the pipeline itself). Installtion instruction is listed below.
The estimated time of instation is about 5-20 minutes.
- Change to a working folder that is large enough to store the snp call result files. Git clone the pipeline and change directory to the pipeline folder
cd YOUR_WORKING_DIR # replace `YOUR_WORKING_DIR` with your real path
git clone https://github.com/bguo068/snp_call_nf.git
cd snp_call_nf
- Install conda from here if you have not
- Install the
nf
and thesnp_call_nf
conda environments
# NOTE: The nextflow engine and the pipeline may need different version of java.
# We use two different Conda environments to address the conflict.
# install nextflow
conda env create -f env/nf.yaml
# install snp_call_nf
conda env create -f env/snp_call_nf.yaml
- Link the reference files (internal users) or prepare them by yourself (external users)
- Link the ref files on IGS server
ln -s /local/projects-t3/toconnor_grp/bing.guo/ref/* ref/
or
ln -s /local/projects-t2/CVD/Takala-Harrison/Cambodia_Bing/ref/* ref/
- Link the ref files on Rosalind, the reference file can be linked by running
ln -s /local/data/Malaria/Projects/Takala-Harrison/Cambodia_Bing/ref/* ref/
- Prepare ref file by yourself:
cd ref
conda activate snp_call_nf
python3 prep_ref_files.py
conda deactivate
cd ..
- Run the pipeline
- Test it on HPC (local):
conda activate nf; nextflow main.nf
- This will use a tiny test dataset from
test_data
folder - To test on a larger test data, please run
cd ena_data; ena_data/download_ena_data.sh
(it may take hours to download all the real data from ENA). Once downloaded, change directory to project folder (wheremain.nf
is located) by runcd ..
, and the pipeline can be run withconda activate nf; nextflow main.nf --fq_map ena_data/ena_fastq_map.tsv
- This will use a tiny test dataset from
- Test it on SGE server:
conda activate nf; nextflow main.nf -profile sge
- This will use a small dataset from
test_data
folder
- This will use a small dataset from
- You will need to edit
fastq_map.tsv
file to include the raw reads(fastq.gz
files) of your own samples.
- Test it on HPC (local):
-
Split chromosomes to better parallelize joint call:
- by default, the genome is split by chromosomes
- you can specify cmd line option
--split intervals
to split the chromosome into more intervals.
-
Enable
vqsr
variant filtering. By default,vqsr
is not enabled. To enable this option, you can specify--vqsr true
to the nextflow command line
- Main input file is
./fastq_map.tsv
- Five columns delimited by tab: string, interger, string, interger, string
HostId
is the index of host genomes from 0, seeparams.host
in nextflow.config fileMateId
can be 0 for single-end sequencing, or 1 and 2 for pair-end sequencing
- Main configureation file is
./nextflow.config
- For SEG users, be sure to edit sge config about
clusterOptions = "-P toconnor-lab -cwd -V"
to reflect your lab specifc sge qsub option
- For SEG users, be sure to edit sge config about
- Main pipeline script is
./main.nf
- Main output files/folders:
result/read_length
folder: report the raw read length for each samples/runsresult/flagstat_host
andresult/flagstat_parasite
: flagstat of aligned reads (aligned with host genome and parasite genome respectively)result/recalibrated
: analysis ready bam filesresult/coverage
: read converage based on the analysis-ready bam filesresult/flagstat
: bam flatstat based on the analysis-ready bam filesresult/gvcf
: single-sample vcf filesresult/hardfilt
: multiple-sample (joint-call) vcf files with hard filterating annotationsresult/vqsrfilt
: multiple-sample (joint-call) vcf files with vqsr-based filterating annotations. You can decide to use one of these,result/hardfilt
andresult/vqsrfilt
.
This pipeline was originally developed for the posseleff
project.
If you find this pipeline useful, please consider citing our preprint:
Guo, B., Borda, V., Laboulaye, R., Spring, M. D., Wojnarski, M., Vesely, B. A., Silva, J. C., Waters, N. C., O'Connor, T. D., & Takala-Harrison, S. (2023). Strong Positive Selection Biases Identity-By-Descent-Based Inferences of Recent Demography and Population Structure in Plasmodium falciparum. bioRxiv : the preprint server for biology, 2023.07.14.549114. https://doi.org/10.1101/2023.07.14.549114