This command-line software enables summary statistics imputation (SSimp) for GWAS summary statistics.
The only input needed from the user are the GWAS summary statistics and a reference panel (e.g. 1000 genomes, needed for LD computation).
Important: along with the SSIMP installation, you will also need to download a file (a database with all positions on different builds) into your ~/reference_panel/
folder.
If this is the first time that you download SSIMP, create a folder typing the following in your terminal:
cd ~
mkdir reference_panels
wget https://drive.switch.ch/index.php/s/uOyjAtdvYjxxwZd/download -O ~/reference_panels/database.of.builds.1kg.uk10k.hrc.2018.01.18.bin
If you had previous versions of SSIMP installed, you NEED to redownload the database.
wget https://drive.switch.ch/index.php/s/uOyjAtdvYjxxwZd/download -O ~/reference_panels/database.of.builds.1kg.uk10k.hrc.2018.01.18.bin
You may get assertion errors when running this, even after following the Installation Instructions. Please also check the Bug Reports section below. Here are a couple of things you can check:
- the size of the build database should be
1'789'839'360
bytes:~/reference_panels/database.of.builds.1kg.uk10k.hrc.2018.01.18.bin
. If not, see Bug Reports below on how to download an updated version - The simplest way to call this is with:
ssimp gwas/small.random.txt 1KG/EUR output.txt
. Otherwise, you can select the AFR population with:ssimp gwas/small.random.txt ~/reference_panels/1000genomes/ALL.chr{CHRM}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz output.txt --sample.names ~/reference_panels/1000genomes/integrated_call_samples_v3.20130502.ALL.panel/sample/super_pop=AFR
. In either case, if you wish to impute only a particular chromosome, you should use--impute.range chr17
. Do not replace theALL.chr{CHRM}
withALL.chr17
, as older versions of ssimp (<=0.4) will fail on this. The best way to restrict to one chromosome is via--impute.range ...
.
We recommend to download the full folder. This will give you access to all toy examples.
(1) Download
wget https://github.com/sinarueeger/ssimp_software/archive/master.zip
unzip master.zip
(2) Access the folder
cd ssimp_software-master
(3) Rename the binary
Depending on what OS you are on:
cp compiled/ssimp-linux-0.4 ssimp
cp compiled/ssimp-osx-0.4 ssimp
Linux - static version
MacOS - dynamic version. You need to install GSL 1.16 (GNU Scientific Library) from here: http://ftp.gnu.org/gnu/gsl/.
Check your gcc version with gcc --version
. It needs to be 5.0.0 or higher.
Mac user? You need to install GSL: brew install gsl
.
(1) Download the zip file and unpack the zip file (or clone the github folder)
wget https://github.com/sinarueeger/ssimp_software/archive/master.zip
unzip master.zip
(2) Access the folder
cd ssimp_software-master
(3) Run the makefile (source compilation)
make
This will create a file bin/ssimp
.
cp bin/ssimp ssimp
(4) Run your first summary statistics imputation on a test file (uses a small toy reference panel)
ssimp --gwas gwas/small.random.txt --ref ref/small.vcf.sample.vcf.gz --out output.txt
(5) To run your own summary statistics imputation, you need to have a reference panel ready (in the folder reference_panels
in your home directory). Check out Download 1000 genomes reference panel below to download.
While the github folder itself is rather small (136 MB), the folder reference_panels
takes up about 17 GB storage on the hard disk when 1KG is installed.
When running, the software uses about 5 GB of memory. Make sure you have enough RAM when running the software in parallel!
It takes roughly ~200 CPU hours to impute to full genome (~20M variants) with 500 individuals from 1000 Genomes reference panel. Runtime depends on the number of variants imputed and the size of the reference panel.
ssimp --gwas gwas/small.random.txt --ref ref/small.vcf.sample.vcf.gz --out output.txt
will impute the Z-statistics, using a selected reference panel (see section above) and generate a file output.txt
. The txt file assigned to --gwas
contains at least the following columns: SNP-id, reference allele, risk allele, Z-statistic, and at least one row. The imputed summary statistics are stored in output.txt
.
Run ssimp
with no arguments to see the usage message.
Check out examples.
We also provide a detailed manual that contains information not present in the usage message.
Important! Reference panels provided in folder ref
are toy reference panels for testing and examples and not made to use for proper usage.
By running ssimp
with a special argument - 1KG/SUPER-POP
- assigned to the reference panel option, it will automatically download 1000 genomes reference panel and use the specified super population for imputation.
Make sure that:
- you have
wget
installed and - no folder
reference_panels
exists.
For example, if we want 1000 genomes to be downloaded and EUR population used for imputation, we type:
ssimp gwas/small.random.txt 1KG/EUR output.txt
It will then download 1000 genomes reference panel and a file that enables fast imputation of single SNPs.
When using ssimp
you can then pass this directory name to 'ssimp', and specify that only
as subset of individuals (here AFR) should be used:
ssimp gwas/small.random.txt ~/reference_panels/1000genomes/ALL.chr{CHRM}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz output.txt --sample.names ~/reference_panels/1000genomes/integrated_call_samples_v3.20130502.ALL.panel/sample/super_pop=AFR
More info on handling reference panel data can be found starting from line 49 in usage message.
stu @all.tests
or check this file.
- Follow instructions in this file.
- Update VERSION file.
Aaron McDaid (implementation, method development)
Sina Rüeger (coordination, method development)
Zoltan Kutalik (supervision, method development)
We have used code (with permission, and under the GPL) from libStatGen and stu.
If you use this software in your scientific research, please consider citing
Rüeger, S., McDaid, A., Kutalik, Z. (2017). Evaluation and application of summary statistic imputation to discover new height-associated loci. PLOS Genetics. https://doi.org/10.1371/journal.pgen.1007371
Rüeger, S., McDaid, A., Kutalik, Z. (2017). Improved imputation of summary statistics for realistic settings. bioRxiv. https://doi.org/10.1101/203927
Copyright 2018. (Aaron McDaid, Sina Rüeger, Zoltán Kutalik, https://github.com/zkutalik/ssimp_software)
Report bugs here: https://github.com/sinarueeger/ssimp_software/issues
If you are having trouble with SSIMP versions >= 0.4, make sure that you delete your database in folder ~/reference_panels/database*
and redownload it using the following command
wget https://drive.switch.ch/index.php/s/uOyjAtdvYjxxwZd/download -O ~/reference_panels/database.of.builds.1kg.uk10k.hrc.2018.01.18.bin