Skip to content

Whole genome construction, indexing and GBWT

Glenn Hickey edited this page May 3, 2018 · 13 revisions

The commands below will rely on the following being set to specify output locations, AWS user info, etc.

# Set your aws region
export TOIL_AWS_ZONE="us-west-2a"
# This will be used below for the Toil Jobstore.  "jobstore" can be replaced with any name
export TOIL_JOBSTORE="aws:us-west-2:jobstore"
# All the output will be put in this S3 bucket.  "outstore" can be replaced with any name
export TOIL_OUTSTORE="aws:us-west-2:outstore"
# This must be a valid AWS keypair, with keys appropriately set up on your computer
export KEYPAIR_NAME=my_keypair_name
# This will be the name of the cluster leader node used, it can be any name
export LEADER=leader
# This is a Toil version that seems to work with toil-vg
export TOIL_APPLIANCE_SELF=quay.io/ucsc_cgl/toil:3.16.0a1.dev2281-c7d77b028064a739e897f7b1eb158c902b530475

Create a leader node (with current toil-vg master)

scripts/create-ec2-leader.sh $LEADER $KEYPAIR_NAME

Make a set of graphs (snp1kg, primary) for the whole genome, including xg and gcsa indexes. See --min_af and --filter_ceph options for additionally creating an allele-frequency thresholded and/or CEPH pedigree filtered graph.

./scripts/construct-hs37d5-ec2.py $TOIL_JOBSTORE $TOIL_OUTSTORE --leader $LEADER --gcsa --xg  --primary --node i3.8xlarge:0.95,i3.8xlarge --max_node 5,2

Create a snp1kg graph and indexes including the GBWT. Also, use the GBWT for GCSA pruning. (note best to use different TOIL_OUTSTORE if running in addition to above to avoid clobbering previous output)

./scripts/construct-hs37d5-ec2.py $TOIL_JOBSTORE $TOIL_OUTSTORE --leader $LEADER --gcsa --xg --gbwt --node i3.8xlarge:0.95,i3.8xlarge --max_node 6,4

The XG format changes frequently. This is an example on how to regenerate the xg index for the primary graph (modify the grep and output to do the other graphs)

scripts/ec2-run.sh -l ${LEADER} "index aws:us-west-2:glennhickey-jobstore1 $TOIL_OUTSTORE --graphs $(for i in $(aws s3 ls s3://BUCKET_OF_GRAPHS | awk '{print $4}' | grep primary); do echo s3://BUCKET_OF_GRAPHS/${i}; done) --xg_index --index_name primary"