Skip to content

genes_json_file

Dave Lawrence edited this page Nov 21, 2022 · 3 revisions

PyReference now uses cdot data files for loading gene/transcript information. cdot provides transcripts for HGVS resolution, and needs to work with all historical and latest versions of GTF/GFF3s from RefSeq and Ensembl. Making a JSON format that can work with both projects reduces effort going forward.

Download prebuilt files

cdot hosts pre-built JSON data.

Below are the latest files for Refseq/Ensembl. Note: GRCh37 is not updated frequently so can be quite old.

RefSeq

Ensembl

Create from GFF/GTF

See cdot wiki

git clone https://github.com/SACGF/cdot
export CDOT_DIR=$(pwd)/cdot/generate_transcript_data

# This generates a gene info JSON file (only need 1 for all generated gene JSON files)
export [email protected]  # Make sure to change this
${CDOT_DIR}/gene_info.sh
CDOT_VERSION=$(${CDOT_DIR}/cdot_json.py --version)
GENE_INFO_JSON=gene-info-${CDOT_VERSION}.json.gz

# Example for refseq GRCh38
FILENAME=GCF_000001405.40_GRCh38.p14_genomic.gff.gz
URL=https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/110/GCF_000001405.40_GRCh38.p14/${FILENAME}
wget ${URL}
cdot/generate_transcript_data/cdot_json.py gff3_to_json --url=${URL} --genome-build=GRCh38 --gene-info-json ${GENE_INFO_JSON} --output ${FILENAME}.json.gz ${FILENAME}
Clone this wiki locally