Skip to content

Why even lift?

Nae-Chyun Chen edited this page Feb 7, 2021 · 2 revisions

Why even lift?

Here we will list some examples where aligners can make mistakes aligning against a reference genome, and how "lifting-over" alignments from personal genome coordinates to reference genome coordinates can help fix those mistakes.

Preliminaries

We will be using the personal genome of chr21 of NA12878 derived from the variants called by the 1000 Genomes project

Required software: bowtie2 (v2.3.5.1), bcftools (v1.11)

  1. Download variant data

hg19 chr21:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr21.fa.gz
gunzip chr21.fa.gz

1KG VCF reference panel for chr21:

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
  1. filter VCF file to include only SNPS and indels (currently, liftover cannot support anything more complicated)
bcftools view -V mnps,other -Oz ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz > chr21.vcf.gz;
bcftools index chr21.vcf.gz
  1. create the personal genome for NA12878
bcftools consensus -f chr21.fa -H 1 -s NA12878 chr21.vcf.gz > chr21.NA12878.fa
  1. For comparison purposes, create bowtie2 indexes for both the reference genome and the personalized genome
bowtie2-build chr21.fa chr21.fa
bowtie2-build chr21.NA12878.fa chr21.NA12878.fa

The best-scoring alignment might not be the best alignment

Consider this read originating from position 21:10705417 of chr21.NA12878.fa

@simulated.44160
ATTTCTGATTCCTATGGTGAAAAAGAAATATCCTCCCATAAAAATTAGACAGAAACATTCTGAGAAACTTCTTTGTGATGTGTGCTTTCATCTCACAGGT
+
HHIHHIHHIGIIIIIGHIFIGEIEGIFHGIHDIFIFIIIDIIBGBII?II@IIIIIGHIDIDIIEHDIIIIIIFI@CEII>IDIII=IIIIIIIIFI<BI

Aligning this read to the hg19 reference using bowtie2 produces the following alignment:

$ bowtie2 -x chr21.fa -U input.fq 2>/dev/null
@HD     VN:1.0  SO:unsorted
@SQ     SN:21   LN:48129895
@PG     ID:bowtie2      PN:bowtie2      VN:2.3.2        CL:"/usr/bin/bowtie2-align-s --wrapper basic-0 -x chr21.fa -U input.fq"
simulated.44160 0       21      10705472        40      3M2I95M *       0       0       ...

The CIGAR string from this alignment is visualized like so:

hg19: GTT--TGATTCCTATGGTGAAAAAGAAATATCCTCCCATAAAAATTAGACAGAAACATTCTGAGAAACTTCTTTGTGATGTGTGCTTTCATCTCACAGGT
read: ATTTCTGATTCCTATGGTGAAAAAGAAATATCCTCCCATAAAAATTAGACAGAAACATTCTGAGAAACTTCTTTGTGATGTGTGCTTTCATCTCACAGGT

Notice that the alignment looks relatively clean, with a 2-base insertion wrt the reference

However, we get a different result when aligning to the personalized reference and lifting-over the alignment to the reference coordinate space:

$ bowtie2 -x chr21.NA12878.fa -U input.fq > out.sam
$ ./liftover -v chr21.vcf.gz -s NA12878 -a out.sam -p lifted
$ cat lifted.sam
@HD     VN:1.6  SO:unknown
@SQ     SN:21   LN:48129895
simulated.44160 0       21      10705461        255     5M9D95M *       0       0       ...

The lifted-over alignment is visualized like so:

hg19: ATTTCTGAGGGGTTTGATTCCTATGGTGAAAAAGAAATATCCTCCCATAAAAATTAGACAGAAACATTCTGAGAAACTTCTTTGTGATGTGTGCTTTCATC
read: ATTTC---------TGATTCCTATGGTGAAAAAGAAATATCCTCCCATAAAAATTAGACAGAAACATTCTGAGAAACTTCTTTGTGATGTGTGCTTTCATC

At first glance, this looks like a worse alignment than before! There's a 9 base deletion wrt the reference, and the position of the lifted-over alignment differs from the originally aligned position by 11 bases.

However, let's take a peek at the variants in the VCF file overlapped by this alignment:

$ bcftools view -H -r 21:10705461-10705561 chr21.vcf.gz
...
21      10705465        rs149334075     CTGAGGGGTT      C       100     PASS    ...     GT      1|1

... a 9 base deletion!

While the alignment directly against the hg19 reference is technically a better scoring alignment, it failed to capture the large deletion detected by the alignment against the personal genome. This highlights the importance of variation-aware alignment.