-
Notifications
You must be signed in to change notification settings - Fork 4
Why even lift?
Here we will list some examples where aligners can make mistakes aligning against a reference genome, and how "lifting-over" alignments from personal genome coordinates to reference genome coordinates can help fix those mistakes.
We will be using the personal genome of chr21 of NA12878 derived from the variants called by the 1000 Genomes project
Required software: bowtie2
(v2.3.5.1), bcftools
(v1.11)
- Download variant data
hg19 chr21:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr21.fa.gz
gunzip chr21.fa.gz
1KG VCF reference panel for chr21:
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
- filter VCF file to include only SNPS and indels (currently, liftover cannot support anything more complicated)
bcftools view -V mnps,other -Oz ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz > chr21.vcf.gz;
bcftools index chr21.vcf.gz
- create the personal genome for NA12878
bcftools consensus -f chr21.fa -H 1 -s NA12878 chr21.vcf.gz > chr21.NA12878.fa
- For comparison purposes, create bowtie2 indexes for both the reference genome and the personalized genome
bowtie2-build chr21.fa chr21.fa
bowtie2-build chr21.NA12878.fa chr21.NA12878.fa
Consider this read originating from position 21:10705417 of chr21.NA12878.fa
@simulated.44160
ATTTCTGATTCCTATGGTGAAAAAGAAATATCCTCCCATAAAAATTAGACAGAAACATTCTGAGAAACTTCTTTGTGATGTGTGCTTTCATCTCACAGGT
+
HHIHHIHHIGIIIIIGHIFIGEIEGIFHGIHDIFIFIIIDIIBGBII?II@IIIIIGHIDIDIIEHDIIIIIIFI@CEII>IDIII=IIIIIIIIFI<BI
Aligning this read to the hg19 reference using bowtie2 produces the following alignment:
$ bowtie2 -x chr21.fa -U input.fq 2>/dev/null
@HD VN:1.0 SO:unsorted
@SQ SN:21 LN:48129895
@PG ID:bowtie2 PN:bowtie2 VN:2.3.2 CL:"/usr/bin/bowtie2-align-s --wrapper basic-0 -x chr21.fa -U input.fq"
simulated.44160 0 21 10705472 40 3M2I95M * 0 0 ...
The CIGAR string from this alignment is visualized like so:
hg19: GTT--TGATTCCTATGGTGAAAAAGAAATATCCTCCCATAAAAATTAGACAGAAACATTCTGAGAAACTTCTTTGTGATGTGTGCTTTCATCTCACAGGT
read: ATTTCTGATTCCTATGGTGAAAAAGAAATATCCTCCCATAAAAATTAGACAGAAACATTCTGAGAAACTTCTTTGTGATGTGTGCTTTCATCTCACAGGT
Notice that the alignment looks relatively clean, with a 2-base insertion wrt the reference
However, we get a different result when aligning to the personalized reference and lifting-over the alignment to the reference coordinate space:
$ bowtie2 -x chr21.NA12878.fa -U input.fq > out.sam
$ ./liftover -v chr21.vcf.gz -s NA12878 -a out.sam -p lifted
$ cat lifted.sam
@HD VN:1.6 SO:unknown
@SQ SN:21 LN:48129895
simulated.44160 0 21 10705461 255 5M9D95M * 0 0 ...
The lifted-over alignment is visualized like so:
hg19: ATTTCTGAGGGGTTTGATTCCTATGGTGAAAAAGAAATATCCTCCCATAAAAATTAGACAGAAACATTCTGAGAAACTTCTTTGTGATGTGTGCTTTCATC
read: ATTTC---------TGATTCCTATGGTGAAAAAGAAATATCCTCCCATAAAAATTAGACAGAAACATTCTGAGAAACTTCTTTGTGATGTGTGCTTTCATC
At first glance, this looks like a worse alignment than before! There's a 9 base deletion wrt the reference, and the position of the lifted-over alignment differs from the originally aligned position by 11 bases.
However, let's take a peek at the variants in the VCF file overlapped by this alignment:
$ bcftools view -H -r 21:10705461-10705561 chr21.vcf.gz
...
21 10705465 rs149334075 CTGAGGGGTT C 100 PASS ... GT 1|1
... a 9 base deletion!
While the alignment directly against the hg19 reference is technically a better scoring alignment, it failed to capture the large deletion detected by the alignment against the personal genome. This highlights the importance of variation-aware alignment.