Skip to content

Whole genome calling

Glenn Hickey edited this page May 6, 2019 · 8 revisions

Toil-vg call

vg call and vg genotype do not scale beyond graphs of a few megabases. The next generation of callers will hopefully be able to leverage the newer indexes in vg to get around this limitation, but in the meantime we split the input up with vg chunk a priori. This also helps to distribute computation to different processes.

To call a whole genome, pass the gam with --gams and a list of chromosome names with --chroms, along with the xg and sample name. Toil boilerplate as described in the README applies for AWS:

toil-vg call $TOIL_JS PATH-TO-XG SAMPLE-NAME $TOIL_OS --gams WHOLE-GENOME.gam --chroms $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}; done)

Use the --recall option to call SVs (that are already in the graph).

The steps performed by toil-vg call are described below (see https://github.com/vgteam/toil-vg/blob/master/src/toil_vg/vg_call.py for the exact steps), or run toil-vg call with --realTimeLogging enabled on a small example and grep Docker Run from the output to see the command-line tools run.

Most commands take -t to specify the number of threads. I've hardcoded a few options below, and probably forgot others. The config and Python are where to find the more complete options and descriptions.

In summary, vg chunk is used to split the input xg and GAM into a bunch of chunks. Each one has a .vg and .gam. A vcf is made for each chunk with vg call and they are merged together.

Update toil-vg now supports the --genotype_vcf option in order to genotype variants in the VCF. This VCF must have been used to construct the graph with toil-vg construct --alt_path_gam_index. This option is preferred over the --recall option, as the output is much easier to interpret and it is slightly more accurate.

1) Indexing the GAM

First, the GAM must be sorted and indexed with vg gamsort

vg gamsort INPUT-GAM -i INPUT-GAM.sorted.gai > INPUT-GAM.sorted.gam

This can take several hours, but only needs to be done once. If toil-vg call finds a .gai, it skips this step. The GAM can be a whole genome or a single chromosome, the remaining steps are unchanged.

2) Chunking the GAM

Next, the xg and indexed GAM are split up into overlapping 2.5 Megabase chunks with vg chunk.

Note that PATH_LIST is a text file with the name of each chromosome to call, one per line, ex

chr1
chr2
chr3 ...

Update toil-vg now uses vg paths --xg --lengths to get a list of all paths and their lengths from the XG. If no paths are specified on the command line, all the paths are used. If the user specified paths with --chroms, this command is still used to determine their lengths (which are used below).

vg chunk -x INPUT-XG \
-a INPUT-GAM.sorted.gam \
-P PATH_LIST \
-c 50 \
-g \
-s 2500000 \
-o 100000 \
-b call_chunk \
-t THREADS \
-E OUTPUT_BED \
-f

The OUTPUT_BED file will contain the information for each calling job. The first 3 columns are the BED coordinates of the chunk, the 4th column is the path of the GAM file for the chunk (replace .gam in this path to get the path of the .vg file of the chunk).

Update When using --genotype_vcf, the (indexed) alt path GAM (created with toil-vg construct/index --alt_path_gam_index) should be passed to vg chunk using via -a <alt_paths>.gam. Multiple occurrences of -a are now supported by vg chunk.

3) Calling each chunk

A line from the BED file above contains all the information necessary to call the chunk. The default options for everything are specified in vg_config.py, and are different for SNPs and SVs. The first step to calling is filtering the gam. This step isn't as useful as it used to be due to improvements in vg, but still improves calling small variants on some benchmarks

vg index chunk.vg -x chunk.xg
vg filter chunk.gam -r 0.9 -fu -m 1 -q 15 -D 999 -x chunk.xg > filtered.gam

Then augmenting the graph

Update When using --genotype_vcf, the graph must first be updated to include the alt paths. These paths are in a GAM chunk (same as alignment chunk, but with -1 added to the name's prefix): vg augment -i chunk.vg chunk-1.gam > chunk_alts.gam && mv chunk_alts.gam chunk.gam

vg augment chunk.vg filtered.gam -q 10 -Z chunk.trans -S chunk.support > chunk_aug.vg

Then the output can be used to make the VCF

vg call chunk_aug.vg -z chunk.trans -s chunk.support -c CHROMOSOME-NAME -r CHROMOSOME-NAME -l CHROMOSOME-LENGTH -o OFFSET > chunk.vcf

CHROMOSOME-NAME can be inferred from the BED file (first column). CHROMOSOME-LENGTH is the total length of the chromosome. It can be inferred from the BED file too (Update: toil-vg now gets path information from the xg -- see above). toil-vg does this by reading each line of the bed file and storing the minimum and maximum position for each chromosome, and computing size using that (see path_bounds and path_size variables in vg_call.py). OFFSET can be inferred from the BED file (second column).

vg genotype can be used instead of vg call. It is slower and takes more memory, but if the GAM is passed in with -G, vg augment is not necessary. It usually performs better on SNPs and small indels than vg call. Only vg call can be used to SVs.

Update When using --genotype_vcf, the input VCF gets passed to vg call with the -f option. All the other options are the same as the defaults using --recall. The VCF passed in via -f should be clipped as below in Step 5). The vg call output does not need to be clipped again.

4) Indexing each chunk

Now sort the VCF

head -10000 chunk.vcf | grep "^#"; cat $@ | grep -v "^#" | sort -k1,1d -k2,2n | bgzip > sorted.vcf.gz

Then index it

tabix -f -p vcf sorted.vcf.gz

5) Clip each chunk

Because of the overlap, we must clip chunks so they exactly represent the regions.

The clipped offset of the chunk_ith BED chunk would be

clipped_chunk_offset = chunk_i * 2500000 - chunk_i * 100000

That's enough to extract the region of interest from our VCF

bcftools view sorted.vcf.gz -t CHROMOSOME-NAME:START-END > clipped.vcf

where

START=clipped_chunk_offset + 1 (+ 50000 if not the first chunk)
END=clipped_chunk_offset + 2500000 -1 (-50000 if not last chunk)

The offset computation here is using the chunk size and half the overlap. In hindsight, it could be simplified by not creating overlapping chunks and boosting the context expansion (-c) in vg chunk, but this hasn't been tested.

Update Clipping not required at this point when using --genotype_vcf. It needs to be done before vg call instead.

6) Merge the chunks

bcftools concat clipped1.vcf.gz clipped2.vcf.gz ....