Skip to content

FastaSeqFetcher

Dave Lawrence edited this page Mar 21, 2023 · 5 revisions

Background

HGVS uses BioCommons SeqRepo to retrieve genome and transcript sequences.

cdot collects as many transcripts as possible, and some of these are not in SeqRepo, meaning you can't resolve them using the BioCommons HGVS library

We have written a SeqRepo replacement that builds transcript sequences by pasting together exons from the genome. This means that every cdot transcript is now resolvable.

Warning: Transcript sequences may differ from a genome sequence - so do this at your own risk! (see section below)

Install

The fasta files need to be bgzipped then indexed with samtools, so they can have random access. You need samtools and bgzip installed.

To do this as you download them:

# GRCh37
wget --quiet -O - https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.fna.gz | gzip -d | bgzip >  GCF_000001405.25_GRCh37.p13_genomic.fna.gz
samtools faidx GCF_000001405.25_GRCh37.p13_genomic.fna.gz
# GRCh38
wget --quiet -O - https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.fna.gz | gzip -d | bgzip > GCF_000001405.39_GRCh38.p13_genomic.fna.gz
samtools faidx GCF_000001405.39_GRCh38.p13_genomic.fna.gz

Warnings / Transcript vs genome differences

Ensembl transcripts always match the genome reference - this only affects RefSeq where genome/transcripts.

The errors that could occur from a genome/transcript mismatch are:

  • Reference base changes
  • Indels are normalized differently as the repeats may be different (leading to a coordinate change)

When choosing what genome reference to use, we take the first one that has a contig we have a mapping for. If you want to ensure a certain build is used, you need to instantiate FastaSeqFetcher() with just one genome, then only convert HGVS using that genome.

It's not clear how often this will fail. The test/benchmark_hgvs.py script uses 500 random ClinVar c.HGVS records, and all converted without any issues.