-
Notifications
You must be signed in to change notification settings - Fork 17
Human reference files from 1000 genomes VCFs
Keiran Raine edited this page Nov 9, 2020
·
5 revisions
The original data for SnpGcCorrection.tsv
was generated using the SNP loci from the Affymetrix SNP 6.0 array. This initial set has no chrY SNPs and can be downloaded here. Please see the associated README.txt for more details.
An updated version using the 1000 genome phase 3 SNPs (including Y chromosome) is available here
This describes the method used to generate the currently recommended SnpGcCorrection.tsv
reference file:
$ export TG_DATA=ftp://ftp.ensembl.org/pub/grch37/release-83/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz
$ curl -sSL $TG_DATA | zgrep -F 'E_Multiple_observations' | grep -F 'TSA=SNV' |\
perl -ane 'next if($F[0] !~ m/^\d+$/ && $F[0] !~ m/^[XY]$/); next if($F[0] eq $l_c && $F[1]-1000 < $l_p); $F[7]=~m/MAF=([^;]+)/; next if($1 < 0.05); printf "%s\t%s\t%d\n", $F[2],$F[0],$F[1]; $l_c=$F[0]; $l_p=$F[1];' \
> SnpPositions_GRCh37_1000g.tsv
Alternate... possibly better (reordered so that the distance check is only applied against a retained event):
$ export TG_DATA=ftp://ftp.ensembl.org/pub/grch37/release-83/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz
$ curl -sSL $TG_DATA | zgrep -F 'E_Multiple_observations' | grep -F 'TSA=SNV' |\
perl -ane 'next if($F[0] !~ m/^\d+$/ && $F[0] !~ m/^[XY]$/); $F[7]=~m/MAF=([^;]+)/; next if($1 < 0.05); next if($F[0] eq $l_c && $F[1]-1000 < $l_p); printf "%s\t%s\t%d\n", $F[2],$F[0],$F[1]; $l_c=$F[0]; $l_p=$F[1];' \
> SnpPositions_GRCh37_1000g.tsv
This example filters on:
Field or info tag | Description | Value | Correlates to |
---|---|---|---|
INFO.E_Multiple_observations | SNP has evidence from multiple sources | presence | zgrep -F 'E_Multiple_observations' |
INFO.TSA | Type of sequence alteration | SNV | grep -F 'TSA=SNV' |
CHROM | Chromosome/contig | completely numeric and X/Y | next if($F[0] !~ m/^\d+$/ && $F[0] !~ m/^[XY]$/); |
POS | Minimum distance between events | 1000 | next if($F[0] eq $l_c && $F[1]-1000 < $l_p); |
INFO.MAF | Minor Allele Fraction, in 1000 genomes this is the fraction of donors exhibiting the allele | 0.05 | $F[7]=~m/MAF=([^;]+)/; next if($1 < 0.05); |
The table above is ordered as the command above is doing the filter to aid with understanding.
Once complete (a few minutes) see Convert SnpPositions.tsv to SnpGcCorrections.tsv