Skip to content

Human reference files from 1000 genomes VCFs

Keiran Raine edited this page Nov 9, 2020 · 5 revisions

Original SNP based reference data

The original data for SnpGcCorrection.tsv was generated using the SNP loci from the Affymetrix SNP 6.0 array. This initial set has no chrY SNPs and can be downloaded here. Please see the associated README.txt for more details.

An updated version using the 1000 genome phase 3 SNPs (including Y chromosome) is available here

1000 genome SNP panel generation

This describes the method used to generate the currently recommended SnpGcCorrection.tsv reference file:

$ export TG_DATA=ftp://ftp.ensembl.org/pub/grch37/release-83/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz
$ curl -sSL $TG_DATA | zgrep -F 'E_Multiple_observations' | grep -F 'TSA=SNV' |\
perl -ane 'next if($F[0] !~ m/^\d+$/ && $F[0] !~ m/^[XY]$/); next if($F[0] eq $l_c && $F[1]-1000 < $l_p); $F[7]=~m/MAF=([^;]+)/; next if($1 < 0.05); printf "%s\t%s\t%d\n", $F[2],$F[0],$F[1]; $l_c=$F[0]; $l_p=$F[1];' \
> SnpPositions_GRCh37_1000g.tsv

Alternate... possibly better (reordered so that the distance check is only applied against a retained event):

$ export TG_DATA=ftp://ftp.ensembl.org/pub/grch37/release-83/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz
$ curl -sSL $TG_DATA | zgrep -F 'E_Multiple_observations' | grep -F 'TSA=SNV' |\
 perl -ane 'next if($F[0] !~ m/^\d+$/ && $F[0] !~ m/^[XY]$/); $F[7]=~m/MAF=([^;]+)/; next if($1 < 0.05); next if($F[0] eq $l_c && $F[1]-1000 < $l_p); printf "%s\t%s\t%d\n", $F[2],$F[0],$F[1]; $l_c=$F[0]; $l_p=$F[1];' \
> SnpPositions_GRCh37_1000g.tsv

This example filters on:

Field or info tag Description Value Correlates to
INFO.E_Multiple_observations SNP has evidence from multiple sources presence zgrep -F 'E_Multiple_observations'
INFO.TSA Type of sequence alteration SNV grep -F 'TSA=SNV'
CHROM Chromosome/contig completely numeric and X/Y next if($F[0] !~ m/^\d+$/ && $F[0] !~ m/^[XY]$/);
POS Minimum distance between events 1000 next if($F[0] eq $l_c && $F[1]-1000 < $l_p);
INFO.MAF Minor Allele Fraction, in 1000 genomes this is the fraction of donors exhibiting the allele 0.05 $F[7]=~m/MAF=([^;]+)/; next if($1 < 0.05);

The table above is ordered as the command above is doing the filter to aid with understanding.

Once complete (a few minutes) see Convert SnpPositions.tsv to SnpGcCorrections.tsv