Skip to content

Running a Whole Genome Pedigree Dataset

Charles Markello edited this page Jan 7, 2020 · 11 revisions

Setup Instructions

Setup the main working directory

cd into a directory with at least 2 TB of allocated Disk space

cd /data/$USER

Launch an interactive session on Biowulf and load requisite Biowulf modules:

sinteractive
module load cromwell/40 git python/3.6

Clone the github repo and create a work directory for running the wdl workflow:

VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
mkdir -p ${VG_WDL_DIR} && cd ${VG_WDL_DIR}
git clone https://github.com/vgteam/vg_wdl.git

Download workflow inputs and set up miniwdl virtual environment to run vg_wdl workflows:

WORKFLOW_INPUT_DIR="/data/$USER/test_vg_wdl_run/workflow_inputs"
${VG_WDL_DIR}/vg_wdl/scripts/setup_vg_wdl.sh -g ${WORKFLOW_INPUT_DIR} -v ${VG_WDL_DIR}
exit

Input Read Setup Instructions

Input Read Setup Instructions

Setup the cohort working directory and collect input reads for the cohort (this should take a few minutes). Only need to change COHORT_NAME from this template. The COHORT_NAME should be the sample name of the proband in a UDP cohort. The COHORT_NAMES_LIST bash array variable needs to list the proband, sibling and parental ids in a space-delimited manner.

COHORT_INPUT_DATA should contain the full path to the directory containing all raw read data of the cohort. For example, if the raw reads for PROBAND and SIBLING_1 are located in /data/Udpdata/Individuals/PROBAND/R1_fastq.gz and /data/Udpdata/Individuals/SIBLING_1/R1_fastq.gz respectively, then the path for COHORT_INPUT_DATA should be /data/Udpdata/Individuals.

COHORT_NAME="UDP****"
COHORT_NAMES_LIST=("UDP_MATERNAL" "UDP_PATERNAL" "UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
COHORT_WORKFLOW_DIR="/data/$USER/test_vg_wdl_run/${COHORT_NAME}_cohort_workdir"
COHORT_INPUT_DATA="/PATH/TO/DIRECTORY/CONTAINING/INPUT/READS"
${VG_WDL_DIR}/vg_wdl/scripts/setup_input_reads.sh -l "${COHORT_NAMES_LIST[*]}" -w ${COHORT_WORKFLOW_DIR} -c ${COHORT_INPUT_DATA}

Running the Workflow

Step 1: Maternal-Paternal-Proband VG Mapping

CD into cohort work directory and setup input variables. Only need to change MATERNAL_SAMPLE_NAME, PATERNAL_SAMPLE_NAME and PROBAND_SAMPLE_NAME from this template.

MATERNAL_SAMPLE_NAME="UDP_MOM"
PATERNAL_SAMPLE_NAME="UDP_DAD"
PROBAND_SAMPLE_NAME="UDP_CHILD"
VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
WORKFLOW_INPUT_DIR="/data/$USER/test_vg_wdl_run/workflow_inputs"
COHORT_WORKFLOW_DIR="/data/$USER/test_vg_wdl_run/${PROBAND_SAMPLE_NAME}_cohort_workdir"

Setup workflow bash script

${VG_WDL_DIR}/vg_wdl/scripts/setup_trio_mapping_script.part_1.sh -p ${PROBAND_SAMPLE_NAME} -m ${MATERNAL_SAMPLE_NAME} -f ${PATERNAL_SAMPLE_NAME} -w ${COHORT_WORKFLOW_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${VG_WDL_DIR}

Run the trio mapping workflow

cd ${COHORT_WORKFLOW_DIR}
sbatch --cpus-per-task=6 --mem=100g --gres=lscratch:200 --time=72:00:00 ${PROBAND_SAMPLE_NAME}_cohort_trio_map.part_1.sh

Step 2: Maternal-Paternal-Proband Genotyping

Setup input variables

MATERNAL_SAMPLE_NAME="UDP_MOM"
PATERNAL_SAMPLE_NAME="UDP_DAD"
PROBAND_SAMPLE_NAME="UDP_CHILD"
VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
WORKFLOW_INPUT_DIR="/data/$USER/test_vg_wdl_run/workflow_inputs"
COHORT_WORKFLOW_DIR="/data/$USER/test_vg_wdl_run/${PROBAND_SAMPLE_NAME}_cohort_workdir"

Setup workflow bash script

${VG_WDL_DIR}/vg_wdl/scripts/setup_trio_calling_script.part_2.sh -p ${PROBAND_SAMPLE_NAME} -m ${MATERNAL_SAMPLE_NAME} -f ${PATERNAL_SAMPLE_NAME} -w ${COHORT_WORKFLOW_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${VG_WDL_DIR}

Run the trio genotyping workflow

cd ${COHORT_WORKFLOW_DIR}
sbatch --cpus-per-task=6 --mem=100g --gres=lscratch:200 --time=24:00:00 ${PROBAND_SAMPLE_NAME}_cohort_trio_call.part_2.sh

Step 3: Paternal Graph Construction

For one of the input variables the PED_FILE must point to a valid .ped file in the form of the COHORT_ID.ped or PROBAND_SAMPLE_ID.ped naming scheme and must follow the tab-delimited PED file format. The .ped file needs to only contain the mother-father-proband trio set of samples. For example the HG002 trio file looks like the following where the proband is HG002 the father is HG003 and the mother is HG004:

#Family ID  Father  Mother  Sex[1=M]    Affected[2=A]
HG002   HG002   HG003   HG004   1   2
HG002   HG003   0   0   1   1
HG002   HG004   0   0   2   1

Setup input variables

MATERNAL_SAMPLE_NAME="UDP_MOM"
PATERNAL_SAMPLE_NAME="UDP_DAD"
PROBAND_SAMPLE_NAME="UDP_CHILD"
VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
WORKFLOW_INPUT_DIR="/data/$USER/test_vg_wdl_run/workflow_inputs"
COHORT_WORKFLOW_DIR="/data/$USER/test_vg_wdl_run/${PROBAND_SAMPLE_NAME}_cohort_workdir"
PED_FILE="${COHORT_WORKFLOW_DIR}/${PROBAND_SAMPLE_NAME}.ped"

Setup workflow bash script

${VG_WDL_DIR}/vg_wdl/scripts/setup_parent_graph_construct_script.part_3.sh -p ${PROBAND_SAMPLE_NAME} -m ${MATERNAL_SAMPLE_NAME} -f ${PATERNAL_SAMPLE_NAME} -c ${PED_FILE} -w ${COHORT_WORKFLOW_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${VG_WDL_DIR}

Run the parental graph construction workflow

cd ${COHORT_WORKFLOW_DIR}
sbatch --cpus-per-task=6 --mem=50g --gres=lscratch:100 --time=72:00:00 ${PROBAND_SAMPLE_NAME}_cohort_parental_graph_construction.part_3.sh

Step 4: Sibling VG Mapping against Parental Graph Reference

CD into cohort work directory and setup input variables. The SIBLING_ID_LIST bash array variable needs to list the proband and sibling ids in a space-delimited manner. The proband must be listed first. For example, if the pedigree has one proband UDP_PROBAND and 2 additional siblings UDP_SIB_1 and UDP_SIB_2: SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2").

SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
WORKFLOW_INPUT_DIR="/data/$USER/test_vg_wdl_run/workflow_inputs"
COHORT_WORKFLOW_DIR="/data/$USER/test_vg_wdl_run/${SIBLING_ID_LIST[0]}_cohort_workdir"

${VG_WDL_DIR}/vg_wdl/scripts/setup_sibling_mapping_script.part_4.sh -s "${SIBLING_ID_LIST[*]}" -w ${COHORT_WORKFLOW_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${VG_WDL_DIR}

Run the sibling alignment workflow

cd ${COHORT_WORKFLOW_DIR}
sbatch --cpus-per-task=6 --mem=50g --gres=lscratch:200 --time=72:00:00 ${SIBLING_ID_LIST[0]}_cohort_2nd_iter_sibling_map.part_4.sh

Step 5: 2nd Iteration Sibling Genotyping and Cohort Joint Genotyping

CD into cohort work directory and setup input variables. The SIBLING_ID_LIST bash array variable needs to list the proband and sibling ids in a space-delimited manner. The proband must be listed first. For example, if the pedigree has one proband UDP_PROBAND and 2 additional siblings UDP_SIB_1 and UDP_SIB_2: SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2").

MATERNAL_SAMPLE_NAME="UDP_MOM"
PATERNAL_SAMPLE_NAME="UDP_DAD"
SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
WORKFLOW_INPUT_DIR="/data/$USER/test_vg_wdl_run/workflow_inputs"
COHORT_WORKFLOW_DIR="/data/$USER/test_vg_wdl_run/${SIBLING_ID_LIST[0]}_cohort_workdir"

${VG_WDL_DIR}/vg_wdl/scripts/setup_pedigree_calling_script.part_5.sh -s "${SIBLING_ID_LIST[*]}" -m ${MATERNAL_SAMPLE_NAME} -f ${PATERNAL_SAMPLE_NAME} -w ${COHORT_WORKFLOW_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${VG_WDL_DIR}

Run the cohort genotyping workflow

cd ${COHORT_WORKFLOW_DIR}
sbatch --cpus-per-task=6 --mem=50g --gres=lscratch:200 --time=24:00:00 ${SIBLING_ID_LIST[0]}_cohort_2nd_iter_pedigree_call.part_5.sh

Step 6: 2nd Iteration Cohort Indel realignment

CD into cohort work directory and setup input variables. The SIBLING_ID_LIST bash array variable needs to list the proband and sibling ids in a space-delimited manner. The proband must be listed first. For example, if the pedigree has one proband UDP_PROBAND and 2 additional siblings UDP_SIB_1 and UDP_SIB_2: SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2").

MATERNAL_SAMPLE_NAME="UDP_MOM"
PATERNAL_SAMPLE_NAME="UDP_DAD"
SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
WORKFLOW_INPUT_DIR="/data/$USER/test_vg_wdl_run/workflow_inputs"
COHORT_WORKFLOW_DIR="/data/$USER/test_vg_wdl_run/${SIBLING_ID_LIST[0]}_cohort_workdir"

${VG_WDL_DIR}/vg_wdl/scripts/setup_pedigree_indel_realignment_script.part_6.sh -s "${SIBLING_ID_LIST[*]}" -m ${MATERNAL_SAMPLE_NAME} -f ${PATERNAL_SAMPLE_NAME} -w ${COHORT_WORKFLOW_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${VG_WDL_DIR}

Run the cohort genotyping workflow

cd ${COHORT_WORKFLOW_DIR}
sbatch --cpus-per-task=6 --mem=50g --gres=lscratch:200 --time=24:00:00 ${SIBLING_ID_LIST[0]}_cohort_2nd_iter_pedigree_indel_realign.part_6.sh

Copy files to an output directory

MATERNAL_SAMPLE_NAME="UDP_MOM"
PATERNAL_SAMPLE_NAME="UDP_DAD"
SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
COHORT_WORKFLOW_DIR="/data/$USER/test_vg_wdl_run/${SIBLING_ID_LIST[0]}_cohort_workdir"
VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
OUTPUT_DIR="${SIBLING_ID_LIST[0]}_workflow_outputs"

${VG_WDL_DIR}/vg_wdl/scripts/collect_outputs.sh -s "${SIBLING_ID_LIST[*]}" -m ${MATERNAL_SAMPLE_NAME} -f ${PATERNAL_SAMPLE_NAME} -w ${COHORT_WORKFLOW_DIR} -o ${OUTPUT_DIR}

Delete intermediate workflow directories if they still exist

rm -fr ${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}_cohort_trio_map.final_outputs
rm -fr ${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}_cohort_trio_call.final_outputs
rm -fr ${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}_cohort_parental_graph_construction.final_outputs
rm -fr ${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}_cohort_2nd_iter_sibling_map.final_outputs
rm -fr ${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}_cohort_2nd_iter_pedigree_call.final_outputs
rm -fr ${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}_cohort_2nd_iter_pedigree_indel_realign.final_outputs