Skip to content

Use an NCBI Assembly Report to Update Contig Names in FASTA VCF GFF BED IntervalList Delimited Files

Nils Homer edited this page Feb 24, 2022 · 5 revisions

Want to change the contig names (e.g. to UCSC's names) in your FASTA, use fgbio!

  1. Download an NCBI assembly report. For example, the assembly report for GRCh38.p14:
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.29_GRCh38.p14/GCA_000001405.29_GRCh38.p14_assembly_report.txt
  1. Download the corresponding FASTA. For example, the FASTA for GRCh38.p14:

Don't forget to also de-compress (gunzip) it!

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.29_GRCh38.p14/GCA_000001405.29_GRCh38.p14_genomic.fna.gz
gunzip GCA_000001405.29_GRCh38.p14_genomic.fna.gz
  1. Create a sequence dictionary with contig aliases from (1) with fgbio CollectAlternateContigNames:

The sequencing dictionary is just a SAM file header! Use the --primary argument to specify the column in the assembly report to use as the primary contig names. Use --alternates to specify the names of other columns to use as alternate contig names (aliases). Use --sequence-roles to define which contig types you want to include.

fgbio CollectAlternateContigNames \
  --input GCA_000001405.29_GRCh38.p14_assembly_report.txt \
  --output GCA_000001405.29_GRCh38.p14_assembly_report.dict \
  --primary UcscName \
  --alternates RefSeqAccession GenBankAccession AssignedMolecule \
  --sequence-roles AssembledMolecule UnlocalizedScaffold UnplacedScaffold AltScaffold

This should create a .dict file:

$ head GCA_000001405.29_GRCh38.p14_assembly_report.dict  
@HD	VN:1.6
@SQ	SN:chr1	LN:248956422	am:1	ga:CM000663.2	sn:1	ra:na	un:chr1	AN:CM000663.2,1	sr:assembled-molecule
@SQ	SN:chr2	LN:242193529	am:2	ga:CM000664.2	sn:2	ra:na	un:chr2	AN:CM000664.2,2	sr:assembled-molecule
@SQ	SN:chr3	LN:198295559	am:3	ga:CM000665.2	sn:3	ra:na	un:chr3	AN:CM000665.2,3	sr:assembled-molecule
@SQ	SN:chr4	LN:190214555	am:4	ga:CM000666.2	sn:4	ra:na	un:chr4	AN:CM000666.2,4	sr:assembled-molecule
@SQ	SN:chr5	LN:181538259	am:5	ga:CM000667.2	sn:5	ra:na	un:chr5	AN:CM000667.2,5	sr:assembled-molecule
@SQ	SN:chr6	LN:170805979	am:6	ga:CM000668.2	sn:6	ra:na	un:chr6	AN:CM000668.2,6	sr:assembled-molecule
@SQ	SN:chr7	LN:159345973	am:7	ga:CM000669.2	sn:7	ra:na	un:chr7	AN:CM000669.2,7	sr:assembled-molecule
@SQ	SN:chr8	LN:145138636	am:8	ga:CM000670.2	sn:8	ra:na	un:chr8	AN:CM000670.2,8	sr:assembled-molecule
@SQ	SN:chr9	LN:138394717	am:9	ga:CM000671.2	sn:9	ra:na	un:chr9	AN:CM000671.2,9	sr:assembled-molecule

Notice the alternate contig names in the AN tag (as per the SAM spec). Lower case tags contain additional information about each contig.

  1. Index the FASTA with samtools faidx:
samtools faidx GCA_000001405.29_GRCh38.p14_genomic.fna
  1. Update the FASTA with fgbio UpdateFastaContigNames:

Below we use --skip-missing to skip outputting contigs in the input FASTA that are not in our input sequence dictionary, which is required because we did not select all the contigs from the assembly report in step (2). The --sort-by-dict option is useful if you want to change the order of the contigs in the sequence dictionary before running this step (some folks like Mitochondria first, who can say).

fgbio UpdateFastaContigNames \
  --input GCA_000001405.29_GRCh38.p14_genomic.fna \
  --dict GCA_000001405.29_GRCh38.p14_assembly_report.dict \
  --output GRCh38.p14_genomic.fna \
  --sort-by-dict true \
  --skip-missing
  1. Rename/copy the sequence dictionary in to place, and index the output FASTA:
cp -v GCA_000001405.29_GRCh38.p14_assembly_report.dict GRCh38.p14_genomic.dict
samtools faidx GRCh38.p14_genomic.fna
  1. Update other file types with the fgbio Update*ContigNames tools:

In particular, the fgbio UpdateDelimitedFileContigNames tool can update any delimited file (e.g. BED file).