LDServer is a fast implementation of various metrics of linkage disequilibrium (LD) between genetic variants.
Features:
- Fast: calculating LD in a typical window (500kb) and on typical VCFs (10,000's of samples) will return in under 3-4 seconds.
- API: serves statistics over a REST API, suitable for web applications (used by LocusZoom).
- Caching: statistics are cached by Redis; common queries will return in under 1 second.
- Paging: iterate over pages of results, rather than processing immediately all at once.
This project contains multiple components that work together to provide these features:
- LDServer C++ shared library, wrapped by boost python.
- Flask apps for calculating and serving the statistics in various ways:
- A standard
rest
app for serving r2, D', and covariance between genetic markers. This app is typically used by web applications that display LD between variants. - A
playground
app to help developers get started with the various APIs. - A
raremetal
app for serving covariance in addition to score statistics for calculating aggregation tests of rare genetic variants.
- A standard
- LDServer
- Documentation
- Installation
- Updating
- Configuring & running the flask apps
To run LDServer in production, we recommend using docker-compose. We provide a base configuration docker-compose.yml
that specifies the core services.
To customize the config to your environment, either create a docker-compose.override.yml file, or copy the base docker-compose.yml
to a new file to edit, for example docker-compose.prod.yml
.
Using an override file is convenient because you can continue using docker-compose <command>
as you normally would. Using a new compose file, such as docker-compose.prod.yml
requires an additional flag upon each invocation: docker-compose -f docker-compose.prod.yml <command>
.
To begin, first checkout a copy of the repository with git clone:
git clone https://github.com/statgen/LDServer.git
cd LDServer
Create a .env
file to specify docker configuration settings:
LDSERVER_PORT=4546
LDSERVER_CONFIG_SCRIPT=/home/ldserver/startup.sh
LDSERVER_WORKERS=4
RAREMETAL_CONFIG_DATA=var/config.yaml
RAREMETAL_WORKERS=4
RAREMETAL_PORT=4545
OMP_NUM_THREADS=1
OPENBLAS_NUM_THREADS=1
LDSERVER_PORT
is the exposed port that the ldserver app will run on. You will likely want to put apache or nginx in front of the server as an HTTP proxy.LDSERVER_CONFIG_SCRIPT
is the path inside the container to the startup script. Usually you will create this script on your local filesystem and mount it into the container with a volume directive (see the exampledocker-compose.override.yml
file below.)LDSERVER_WORKERS
is the number of asynchronous workers to be started to serve ldserver requests.OMP_NUM_THREADS
andOPENBLAS_NUM_THREADS
control the number of threads that will be used to compute LD. Only affects the ldserver flask app, not raremetal.RAREMETAL_CONFIG_DATA
is the path inside the container to the raremetal app's data config yaml, which specifies all of the datasets to load. Similarly to theLDSERVER_CONFIG_SCRIPT
, this should be mounted into the container in thedocker-compose.override.yml
file.RAREMETAL_WORKERS
seeLDSERVER_WORKERS
RAREMETAL_PORT
seeLDSERVER_PORT
Create a docker-compose.override.yml
file where you can provide your own additional docker configuration. For
example:
version: '3'
services:
ldserver:
build:
args:
UID: 1000
GID: 1001
volumes:
- /opt/ld/ld_panels:/data/ld_panels
- /opt/ld/logs:/data/logs
- /opt/ld/config.py:/home/ldserver/rest/instance/config.py
- /opt/ld/startup.sh:/home/ldserver/startup.sh
restart: "unless-stopped"
command: >
/bin/bash -c "
source $$LDSERVER_CONFIG_SCRIPT &&
gunicorn -b 0.0.0.0 -w $$LDSERVER_WORKERS -k gevent \
--access-logfile /data/logs/gunicorn.access.log \
--error-logfile /data/logs/gunicorn.error.log \
--pythonpath rest 'ldserver:create_app()'"
raremetal:
build:
args:
UID: 1000
GID: 1001
volumes:
- /opt/raremetal:/home/ldserver/var
- /opt/raremetal/config.py:/home/ldserver/rest/instance/config.py
restart: "unless-stopped"
redis:
volumes:
- /opt/ld/redis_cache:/data/redis/redis_cache
- /opt/ld/logs:/data/redis/logs
- /opt/ld/redis.conf:/usr/local/etc/redis/redis.conf
command: "/usr/local/etc/redis/redis.conf"
restart: "unless-stopped"
In the above example, we overrode a few pieces of the various services.
For the ldserver
service:
- We overrode the command that is executed when the container first runs to pass additional arguments to gunicorn, which starts the
ldserver
flask app. In this case we've added new logfile arguments--access-logfile
and--error-logfile
. - We specified a number of directories and files to mount within the container using
volumes
. The syntax isHOST_PATH:CONTAINER_PATH
. For example, the file/opt/ld/startup.sh
on the host machine will appear as/home/ldserver/startup.sh
within the ldserver container when it runs.
For the raremetal
service:
- We're also mounting a directory
/opt/raremetal
and config file/opt/raremetal/config.py
from the host when the container runs. Volumes have the format/path/to/data
:/path/to/mount/in/container
.
For both services:
- We've turned on
restart: "unless-stopped"
. This means the services will be restarted if they fail, or if the docker daemon is restarted. They will not restart if they are manually stopped. - We decided to override the UID and GID of the user inside the docker image. It is important to specify UID/GID for both the ldserver and raremetal images to avoid cache misses during the docker build process. WARNING: do not specify a very large UID or GID, or the build process will take a long time due to a known bug in docker.
You must supply a slightly modified redis.conf
for running under docker. You can grab a copy of the default redis.conf
to modify by going to https://redis.io/topics/config (we are currently using redis 5.0).
It is a good idea to customize redis to control memory usage, and path to where the periodic database saves are done. The docker-compose.override.yml
file above shows to how to mount your own redis.conf
into the redis service container when it runs.
The following settings should be set in your redis.conf
:
bind 0.0.0.0
port 6379
logfile "/data/redis/logs/redis.log"
save 14400 1
dbfilename ld.rdb
dir /data/redis/redis_cache/
maxmemory 4g
maxmemory-policy allkeys-lru
bind 0.0.0.0
is needed to allow redis to connect to other docker containers (such as the ldserver or raremetal services).
Change maxmemory
to something sensible considering the amount of available RAM on your server.
There is a default config file that can be used as a starting place in rest/config/default.py
. It looks like:
SQLALCHEMY_DATABASE_URI = 'sqlite:///sql.db'
SQLALCHEMY_TRACK_MODIFICATIONS = False
PROXY_PASS = None # set if Apache mod_proxy is used e.g. http://my.host.com/prefix/
API_MAX_PAGE_SIZE = 100000
API_MAX_REGION_SIZE = 4000000
API_MAX_COV_REGION_SIZE = 1000000
SEGMENT_SIZE_BP = 1000
CACHE_ENABLED = True
CACHE_REDIS_HOSTNAME = '127.0.0.1'
CACHE_REDIS_PORT = 6379
GZIP_COMPRESSION = True # enable build-in response compression if for any reason it was not possible to enable it through Apache
API_BASE_URL = 'http://127.0.0.1:5000' # specify the base URL address of the LD server. It is used by the Playground.
SENTRY_DSN = None # include your Sentry DSN for error reporting to your own Sentry instance
SENTRY_ENV = None # name of your deployment - usually 'production' or 'staging' or 'travis'
You can copy this file and customize it. It should be placed in a location you have specified in your docker-compose.override.yml
(see above), for example: /opt/ld/config.py
or /opt/raremetal/config.py
.
CACHE_REDIS_HOSTNAME
to redis
.
If you have an Apache proxy forwarding requests, you should specify PROXY_PASS
. For example, let's say your server has Apache setup to do:
- https://myserver.com/ld/ -> http://localhost:4546
- https://myserver.com/raremetal/ -> http://localhost:4545
Then, in your ldserver's config.py, specify PROXYPASS = 'https://myserver.com/ld/'
. And so forth for the raremetal server as well.
Mount a configuration script into the container by changing:
volumes:
- XXX:/home/ldserver/startup.sh
The XXX
is a path on your local filesystem to a script that contains commands for adding datasets to the ldserver. See the Configuring the ldserver app section for commands to use for adding reference genotype files. If desired, you can modify the name of this script, but you must also modify the LDSERVER_CONFIG_SCRIPT
variable to refer to the new path (as it would appear inside the container.)
You will also need to mount your data into the container. As an example, the line /opt/ld/ld_panels:/data/ld_panels
mounts a directory of data /opt/ld/ld_panels
on your local filesystem into the container at /data/ld_panels
.
An example startup.sh
looks like:
#!/bin/bash
# If the database already exists, then we don't want to run the commands below or they will fail.
# This can happen when the container is being restarted, instead of a clean down/up. The volume
# persists, and therefore so does the database. Remember to always do `docker-compose down && docker-compose up -d`.
if [[ ! -f "rest/ldserver/sql.db" ]]; then
echo "Running startup flask add commands..."
flask add-reference 1000G "1000 Genomes Project Phase 3" GRCh37 /data/ld_panels/sav/1000G_GRCh37/ALL.samples.txt /data/ld_panels/sav/1000G_GRCh37/1000G_phase3_GRCh37_ALL_chr*sav
flask add-reference 1000G "1000 Genomes Project Phase 3" GRCh38 /data/ld_panels/sav/1000G_GRCh38/ALL.samples.txt /data/ld_panels/sav/1000G_GRCh38/1000G_phase3_GRCh38_ALL_chr*sav
flask create-population GRCh37 1000G AFR /data/ld_panels/sav/1000G_GRCh37/AFR.samples.txt
flask create-population GRCh37 1000G EUR /data/ld_panels/sav/1000G_GRCh37/EUR.samples.txt
else
echo "Database file already existed (docker container restart?), skipping flask add commands..."
fi
The raremetal app uses a YAML file for configuration. See the raremetal app section "YAML: Adding multiple datasets with one YAML config file" for more information. Map one into the container by modifying:
volumes:
- XXX:/home/ldserver/var
Remember that when specifying paths in the YAML file, you must specify paths as they would appear in the container. In the example above, we have mapped in our data files into the directory /home/ldserver/var
, and therefore the paths in var/test.yaml
should be var/data_file.txt
, etc. For example:
genotypes:
- id: 1
name: "1000G"
description: "1000G chr22 Testing VCF"
filepath: "var/test.vcf.gz"
genome_build: "GRCh37"
Assuming you have created your own docker-compose.override.yml
file for production, first build the docker images using:
bin/docker_build_compose.sh
You don't have to use the above script, and could instead directly run docker-compose build
, but it does set a few extra labels on the resulting docker image that can be useful.
You can also customize your build with your own script, which can be useful for setting various settings, such as parallel build CPUs:
#!/bin/bash
LDSERVER_VERSION=`git describe --tags --abbrev=11 | sed 's/^v//' | sed 's/-g/-/'`
GIT_SHA=`git rev-parse HEAD`
BUILD_DATE=`date -u +'%Y-%m-%dT%H:%M:%SZ'`
docker-compose build --pull \
--build-arg MAKEFLAGS="-j 9" \
--build-arg CMAKE_BUILD_PARALLEL_LEVEL=9 \
--build-arg BUILD_DATE=${BUILD_DATE} \
--build-arg GIT_SHA=${GIT_SHA} \
--build-arg LDSERVER_VERSION=${LDSERVER_VERSION} "$@"
In the example above, we added some extra args (MAKEFLAGS
and CMAKE_BUILD_PARALLEL_LEVEL
) to specify 9 CPUs should be used in parallel when compiling the C++ components.
Now, you can run all services once the build is finished using:
# -d runs services detached in background
docker-compose up -d
If you instead chose to use a copied production yaml file, you would do the following. Remember to include this -f
flag in all docker-compose commands. We will not include it from here onwards.
docker-compose -f docker-compose.prod.yml up -d
To debug the container, the following command is very useful. It will start the raremetal service, using all of your configuration values from the docker-compose file, and drop you into a bash shell as root. From there, you can run the services manually to debug, and install any additional packages that could be useful. Obviously, you should not use this for production, as the container is not meant to be run as root.
docker-compose run -u root ldserver bash
If you've already started services, replace run
with exec
to jump into the already running container.
Before installing, the following software/packages are required:
- CMake 3.14+
- gcc 5.4+
- Python 3.8+
- A BLAS implementation, such as OpenBLAS or Intel MKL
The following software is optional:
- Redis
Most Linux distributions provide these packages already. For example, in Ubuntu:
sudo apt-get update && apt install \
build-essential \
curl \
cmake \
python3 \
python3-dev \
python3-distutils \
python3-setuptools \
python3-pip \
python3-wheel \
zlib1g-dev \
liblzma-dev \
libopenblas-base \
libopenblas-dev \
liblapack-dev \
libarpack2 \
libarpack2-dev \
redis \
locales \
git \
pkg-config
Now follow these steps to complete installation:
-
Clone repository.
git clone https://github.com/statgen/LDServer cd LDServer
-
Compile and install.
virtualenv -p python3 venv && source venv/bin/activate && pip install cget && cget install core
-
Optional: Test the LDServer C++ shared library. The '127.0.0.1' and '8888' arguments are the hostname and port for Redis server which will be started during unit tests. If you wish to run Redis server on another port, then change these values correspondingly.
cd cget/test/ && ./testAll 127.0.0.1 8888 && cd ../../
-
Install required python packages for the flask apps.
# Activate the environment source venv/bin/activate # Install required packages pip install -r rest/requirements.txt
-
Optional: Run the flask app test suite.
cd rest python -m pytest
-
Configure REST API by copying the config file
rest/config/default.py
torest/instance/config.py
and then modify values as needed. Modifyingdefault.py
is not recommended as it will be overwritten when pulling updates from github.
Required packages (gcc, cmake, etc.) can be installed using homebrew.
We recommend however using pyenv to install the required python 3.8 interpreter and development headers/lib. The reason is that homebrew does not allow installing a particular version of python, and its python is also not meant for users (it is meant for homebrew to compile its own packages against.)
On MacOS, however, pyenv only compiles libpython as a static lib. Unfortunately, to avoid problems during linking on Mac, libpython must be compiled and linked in as a shared lib. The following is an example of installing python 3.8 with the necessary shared lib:
env PYTHON_CONFIGURE_OPTS="--enable-shared" SDKROOT=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk MACOSX_DEPLOYMENT_TARGET=10.15 pyenv install 3.8.9
Update your files using git pull
, then recompile the core server by doing cget install --update core
.
This app serves r2, D', and covariance of genetic variants over a REST API.
- Execute
flask add-reference
command:For example, the below command adds a new reference panel namedcd rest export FLASK_APP=ldserver flask add-reference <name> <description> <genome build> <samples file> <genotype files>
1000G_GRCh37
. Thesamples.txt
file stores list of sample names inALL.chr*.bcf
files.The genotypes can be stored in VCF, BCF, and SAV formats. For better runtime performance and more compact storage, we highly recommend using SAV format.flask add-reference 1000G "1000 Genomes Project Phase 3" GRCh37 samples.txt ALL.chr*.bcf
-
To define populations in the reference, execute
flask create-population
command:cd rest export FLASK_APP=ldserver flask create-population <genome build> <reference name> <population name> <samples file>
For example, the below command defines
AFR
population in the1000G_GRCh37
reference. Thesamples.txt
file stores list of sample names that will be included intoAFR
.flask create-population GRCh37 1000G AFR samples.txt
-
Run the following command to list all references that are currently loaded into the server:
flask show-references
-
Run the following command to list all genotype files loaded for a specified reference:
flask show-genotypes <genome build> <reference name>
Example:
flask show-genotypes GRCh37 1000G
-
Run the following command to list all sample names from a specified population in a specified reference:
flask show-population <genome build> <reference name> <population name>
Example:
flask show-population GRCh37 1000G EUR
-
Start Redis cache (provide configuration parameters depending on your needs):
cd cget/bin/ ./redis-server
-
Start LD server API:
cd rest export FLASK_APP=ldserver flask run
Or with
gunicorn
:gunicorn -b 127.0.0.1:[port] -w [n workers] -k gevent --pythonpath rest "ldserver:create_app()"
In production you are advised to run with
gunicorn
, as it provides a number of asynchronous workers to handle requests. Usingflask run
as above starts only a single synchronous worker, which can be blocked by long running requests.
This app provides a simple developer sandbox for learning the REST API.
cd rest
export FLASK_APP=playground
flask run [ --port ... ]
Or with gunicorn
:
gunicorn -b 127.0.0.1:[port] -w [n workers] -k gevent --pythonpath rest "playground:create_app()"
This app serves a more specialized REST API for calculating aggregation tests of rare genetic variants. It provides covariance between variants, in addition to score statistics (summary statistics of the association between genetic variants and phenotypes.)
These statistics can be used by the raremetal.js package to perform the aggregation tests. That package also provides documentation on the methodology.
Note: at times, we may refer to this also as the raremetal server.
The server has 2 required types of data:
- Genotype files (VCF, BCF, or Savvy format)
- Phenotype files (TAB-delimited, or PED/DAT)
There is one optional type of data, which are mask files that group together variants and assign them to a particular gene or region. This data is optional if you wish to send your definitions of masks directly to the server via the covariance API request.
Genotype datasets may be in VCF, BCF, or Savvy formats.
VCF files must be bgzipped and tabixed. In order to create a tabix-index for your VCF, it must be bgzipped first. Gzip is not sufficient. The bgzip
program is included with tabix
typically.
These two programs can be installed by downloading htslib and compiling. They can also be installed via Ubuntu package: sudo apt-get install tabix
, though this package may not be up to date at times depending on your Ubuntu version.
Phenotype datasets are files containing a number of phenotypes (such as BMI, fasting glucose, heart rate, etc.) collected on a set of samples. Each row is a sample/individual, and each column is a phenotype.
There are two common formats for phenotype files - PED is probably the most common, but it is also common to use simple tab-delimited files. We support both formats currently.
PED format is described both on our site and by PLINK.
For a tab-delimited file, the format is simply one header row denoting the column names, followed by a row for each individual. The first column is assumed to be the sample IDs. For example:
fid iid patid matid sex rand_binary rand_qt
HG00096 HG00096 0 0 female NA 0.283212
HG00097 HG00097 0 0 male 1.0 0.248001
HG00099 HG00099 0 0 male 0.0 0.691624
HG00100 HG00100 0 0 male 0.0 0.284674
HG00101 HG00101 0 0 male 1.0 0.494104
Categorical variables such as sex
can be given labels as above. rand_binary
and rand_qt
are randomly generated phenotypes for this example.
Missing values should be encoded as NA
.
A mask file maps genetic variants to "groups", which are typically genes (though they could also be arbitrary genomic regions.) Typically mask files are created using variant filters, such as "allele frequency < 0.01" or "protein truncating variants AND loss-of-function variants".
The mask file is tab-delimited, and has one "group" per row. For example:
ZBED4 22 50276999 50283712 22:50276999_C/T 22:50277035_C/T 22:50277046_A/C 22:50277051_C/T 22:50277087_T/C 22:50277093_G/A
ALG12 22 50293950 50312062 22:50293950_A/G 22:50293955_G/T 22:50294146_C/T 22:50294162_C/G 22:50294175_G/A 22:50294301_G/A
Each row begins with 4 columns:
- The group name (in this case they are genes)
- Chromosome
- Start position of the group
- End position of the group
The remaining values on each row are the variants that are assigned to the group. They should also be tab-delimited.
The mask file must then be bgzipped, and tabix indexed. An easy way to do this is:
bgzip my_mask_file.tab
tabix -s 2 -b 3 -e 4 my_mask_file.tab.gz
Some studies are not able or willing to share their genotypes and/or phenotype data. In such cases, they provide summary statistics instead, which can be used to run various association analyses or meta-analysis.
The LDServer can directly serve these summary statistics (score statistics, covariance matrices) rather than computing them on the fly.
Common programs for computing summary statistics as a frame of reference:
Both programs produce two files, one containing score statistics for single variants, the other file containing the covariance of the score statistics. You can add these files to a Summary statistic record under the score_path
and cov_path
keys.
Before running the following commands, you should set the appropriate flask app by doing:
export FLASK_APP="rest/raremetal"
To add datasets to the server, we recommend using the YAML config approach. This allows you to specify all of your datasets in a single file, and to maintain them across server restarts.
You can also add datasets via CLI commands for quick testing, but we do not recommend this for production usage.
Datasets may be specified in a YAML file, which provides additional features over the CLI interface:
- Set your own dataset IDs. This allows you to maintain consistency on IDs when reloading the database.
- Specify columns that are not meant to be used for analysis. These columns will not show up in metadata queries.
- Specify longer text descriptions for each column.
An example YAML file is located in data/test.yaml
, and looks like:
genotypes:
- id: 1
name: "1000G"
description: "1000G chr22 Testing VCF"
filepath: "data/chr22.test.vcf.gz"
genome_build: "GRCh37"
phenotypes:
- id: 1
name: "1000G random phenotypes"
description: "An example set of randomly generated phenotypes for 1000G"
genotypes: 1
filepath: "data/chr22.test.tab"
delim: "\t"
columns:
iid:
column_type: "TEXT"
sample_column: true
sex:
column_type: "CATEGORICAL"
for_analysis: false
rand_binary:
column_type: "CATEGORICAL"
description: "A random binary phenotype"
rand_qt:
column_type: "FLOAT"
description: "A random quantitative phenotype"
- id: 2
name: "1000G random phenotypes II"
genotypes: 1
filepath: "data/chr22.more_phenotypes.test.ped"
description: "Adding a second set of phenotypes for 1000G"
delim: "\t"
columns:
ANOTHER_RAND_QT:
column_type: "FLOAT"
description: "Another random quantitative phenotype"
masks:
- id: 1
name: "AF < 0.01"
description: "Variants with allele frequency < 1%"
filepath: "data/mask.epacts.chr22.gencode-exons-AF01.tab.gz"
genome_build: "GRCh37"
genotypes: 1
group_type: "GENE"
identifier_type: "ENSEMBL"
- id: 2
name: "AF < 0.05"
description: "Variants with allele frequency < 5%"
filepath: "data/mask.epacts.chr22.gencode-exons-AF05.tab.gz"
genome_build: "GRCh37"
genotypes: 1
group_type: "GENE"
identifier_type: "ENSEMBL"
- id: 3
name: "All exomic variants"
description: "All variants falling within exomic regions in TOPMed"
summary_stats: 1
genome_build: "GRCh37"
group_type: "GENE"
identifier_type: "ENSEMBL"
summary_stats:
- id: 1
name: "TOPMed 70K Exomes - T2D"
description: "TOPMed 70K exomes score statistics and covariance matrices for T2D"
genome_build: "GRCh37"
score_path: "topmed.metascore.txt.gz"
cov_path: "topmed.metacov.txt.gz"
The file has 4 possible "blocks": genotypes
, phenotypes
, masks
, and summary-stats
.
Each record under genotypes
looks like:
- id: <genotype dataset ID you would like to use>
name: <short description of genotype dataset, typically a study name>
description: <long form description of dataset>
filepath: <path to VCF, BCF, or Savvy file>
genome_build: <genome build, e.g. GRCh37 or GRCh38>
Each record under phenotypes
looks like:
- id: <phenotype dataset ID you would like to use>
name: <short description of phenotypes>
description: <long description of phenotypes>
genotypes: <genotype dataset ID, i.e. which genotypes do these samples line up with?>
filepath: "data/chr22.test.tab"
delim: "\t"
columns:
<column name as it exists in header or DAT file for PED files>:
column_type: <type, can be "TEXT" or "CATEGORICAL" or "FLOAT"
sample_column: <true or false, is this column the sample ID column?>
for_analysis: <true or false, should this phenotype be used in analysis?>
description: <long form description of the phenotype>
Under columns
, the following keys are optional:
sample_column
: (if no columns specified as the sample ID column, we assume it is the first column in the file.)for_analysis
: all columns are assumed to be used for analysis unless specified otherwise.column_type
: if not specified, the type will be guessed by examining the phenotype file.
Also, specifying columns is entirely optional itself. Column names and types will be deduced if not provided.
For PED files, you do not need to specify anything about the first 5 columns (family ID, individual ID, paternal ID, maternal ID, sex) - they are part of the format and automatically handled. If you specify information for the remaining phenotypes in the file, note that they must match the phenotype as specified in the DAT file.
Each record under masks
looks like:
- id: <mask ID you would like to use>
name: <short description of mask>
description: <long description of mask>
filepath: "data/mask.epacts.chr22.gencode-exons-AF01.tab.gz"
genome_build: <genome build, e.g. GRCh37>
genotypes: <genotype dataset IDs containing the variants specified in this mask>
summary_stats: <summary stat dataset IDs>
group_type: <group type, can be "GENE" or "REGION">
identifier_type: <identifier type, currently only "ENSEMBL" supported>
A mask can be linked to any number of genotype datasets, or summary statistic datasets. For example:
- id: 1
name: "All possible protein truncating variants in the genome"
genotypes: [1, 2, 3]
summary_stats: [1, 4, 7]
...
Each record under summary_stats
looks like:
summary_stats:
- id: 1
name: "TOPMED-70K-EXOMES"
description: "TOPMed pre-computed covariance for 70K exomes"
genome_build: "GRCh37"
score_path: "topmed.metascore.txt.gz"
cov_path: "topmed.metacov.txt.gz"
This section is for pre-calculated meta-analysis summary statistics, generated by a program such as RAREMETALWORKER or rvtests.
For example, in rvtests: https://github.com/zhanxw/rvtests#meta-analysis-models
The score statistic (score_path
) and covariance matrix (cov_path
) files must both be tabix indexed. RAREMETALWORKER or rvtest should do this automatically for you (there will be a .tbi
tabix index file created for each score/cov file.)
If your files are split by chromosome, you can specify both paths as a glob with the *
character. For example:
score_path: "topmed.chr*.metascore.txt.gz"
cov_path: "topmed.chr*.metascore.txt.gz"
MetaSTAAR parquet format is also supported. Specify format: "METASTAAR"
in the record (see below for example.)
MetaSTAAR files are often in a hierarchy of directories, split up by genome chunks rather than by chromosome. You
can use the glob pattern **
to traverse across directories, allowing the server to index all parquet files.
Note that MetaSTAAR parquet format is experimental.
- id: 4
name: "MetaSTAAR TOPMed T2D"
description: "MetaSTAAR score/cov files for TOPMed freeze 9 T2D samples"
genome_build: "GRCh37"
score_path: "data/metastaar_t2d/**/segment*.metastaar.sumstat.parquet"
cov_path: "data/metastaar_t2d/**/segment*.metastaar.cov.parquet"
format: "METASTAAR"
The YAML file may have genotype, phenotype, mask, and summary-stats blocks specified in any order, however there may only be 1 block for each type.
Additionally, genotypes will always be processed first, followed by phenotypes, and finally mask files. This is because the latter two are dependent on having at least 1 genotype dataset available.
To add the datasets specified by the YAML file:
flask add-yaml <path/to/yamlfile>
To add the dataset to the server, use the add-genotypes
command:
flask add-genotypes <short label> <long description> <genome build> <VCF/BCF/Savvy file>
The parameters:
<short label>
is a short description of the dataset, often a study abbreviation.<long description>
a longer description that may include the genotyping platform or sequencing.<genome build>
is the genome build of the positions in the file.<VCF/BCF/Savvy file>
the file containing genotypes for variants over a number of samples. We recommend placing these files (or symbolic links) to these files in thedata/
directory under the application root. You may also provide a glob of files, in the event your genotypes are separated into files by chromosome.
Optional parameters:
--samples <file>
provide a file with a list of samples to use, if you do not wish to use all of the samples in the genotype file. One sample per line.
As an example:
gid=`flask add-genotypes "1000G" "1000G Test VCF" "GRCh37" data/chr*.test.vcf.gz`
With the command above, you can capture the genotype dataset ID that was assigned in the database, and use it in later commands.
To add a phenotype dataset:
flask add-phenotypes <short label> <long label> <path to ped or tab file> <genotype dataset ID>
The parameters:
<short label>
is a short description of the dataset, often a study abbreviation.<long description>
a longer description that may describe the set of phenotypes.<path to ped or tab file>
path to either the PED or tab file. If a PED file is supplied, it is assumed there is an accompanying DAT file. If supplying a tab-delimited file, it must end with a.tab
.<genotype dataset ID>
ID of the genotype dataset to connect this phenotype dataset with.
As an example:
flask add-phenotypes Test "Test 1" "data/chr22.test.tab" $gid
Where $gid
was set above when adding the genotype dataset for these phenotypes.
Each phenotype dataset added should correspond to a genotype dataset (and hence supplying the genotype dataset ID.) This allows the server to know which phenotypes are available (and valid) for each dataset.
To add your mask file to the server:
flask add-masks <name> <description> <path to mask file> <genome build> <genotype dataset ID> <group type> <identifier type>
The parameters:
<name>
the name of the mask which will be used in queries. This should ideally be short, for example "AF<0.01 & PTV & LoF".<description>
a longer description of what went into creating the mask. For example: "Allele frequency < 1% and all protein truncating & loss-of-function variants".<path to mask file>
path to the mask file. A tabix-index is assumed to reside next to the file. For example, the mask file ismask.tab.gz
and this is the file you would provide, there should also bemask.tab.gz.tbi
as well.<genome build>
genome build the positions are anchored to<genotype dataset ID>
provide the genotype dataset ID this mask corresponds to. Masks are typically generated for a specific set of variants provided in a genotype file.<group type>
can be "GENE", or "REGION"<identifier type>
what is the identifier for each group? Currently only "ENSEMBL" is supported.
For example:
flask add-masks "AF < 0.01" "Variants with alternate allele freq < 0.01" "data/mask.epacts.chr22.gencode-exons-AF01.tab.gz" "GRCh37" $gid "GENE" "ENSEMBL"
Where $gid
was the genotype dataset ID generated after using the add-genotypes
command earlier.
You can see the list of phenotypes, genotypes, and masks that have been given to the server using:
flask show-genotypes
flask show-phenotypes
flask show-masks
For quickly starting a server:
export FLASK_APP=rest/raremetal
flask run
You can enable debug mode by also including export FLASK_DEBUG=1
.
For production, use gunicorn
:
gunicorn -b 127.0.0.1:[port] -w [n workers] -k gevent --pythonpath rest "raremetal:create_app()"
The full API specification can be found in a separate document.
raremetal.js and locuszoom.js have built-in support for these APIs. You may only need to parse the /aggregation/metadata
endpoint below to provide UI for users selecting which genotypes/phenotypes/masks they would like to use. After that, locuszoom can leverage raremetal.js to make the appropriate calls and parse the response accordingly.
As a brief summary: there are two primary endpoints of interest:
- GET
/aggregation/metadata
- this endpoint shows available datasets - POST
/aggregation/covariance
- computes covariance and score statistics within a region
The metadata endpoint returns JSON that looks like:
{
"data": [
{
"description": "1000G chr22 Testing VCF",
"genomeBuild": "GRCh37",
"genotypeDataset": 1,
"masks": [
{
"id": 1,
"name": "AF < 0.01",
"description": "Variants with allele frequency < 1%",
"groupType": "GENE",
"identifierType": "ENSEMBL",
},
{
"id": 2,
"name": "AF < 0.05",
"description": "Variants with allele frequency < 5%",
"groupType": "GENE",
"identifierType": "ENSEMBL",
}
],
"name": "1000G",
"phenotypeDatasets": [
{
"description": "An example set of randomly generated phenotypes for 1000G",
"name": "1000G random phenotypes",
"phenotypeDataset": 1,
"phenotypes": [
{
"description": "A random binary phenotype",
"name": "rand_binary"
},
{
"description": "A random quantitative phenotype",
"name": "rand_qt"
}
]
},
{
"description": "Adding a second set of phenotypes for 1000G",
"name": "1000G random phenotypes II",
"phenotypeDataset": 2,
"phenotypes": [
{
"description": "Another random quantitative phenotype",
"name": "ANOTHER_RAND_QT"
}
]
}
]
}
]
}
Each genotype dataset (think VCF) has a number of masks and phenotype datasets associated with it.
The covariance endpoint is a POST request, the query is a JSON object of the following form:
{
"chrom": "22",
"start": 50276998,
"stop": 50357719,
"genotypeDataset": 1,
"phenotypeDataset": 1,
"phenotype": "rand_qt",
"samples": "ALL",
"genomeBuild": "GRCh37",
"masks": [1]
}
This tells the covariance endpoint to compute within the region 22:50276998-50357719, using the genotype dataset with ID 1, phenotype dataset with ID 1, and mask with ID 1. The phenotype will be rand_qt
(in the test/example data, the phenotypes are randomly generated, since we can't release actual data.)
Typically you will always set samples
to ALL
. The server supports arbitrary subsets of samples, such as sub-populations, that can be specified with the flask create-sample-subset
command.
Rather than using a server-side mask and supplying an ID (as in the above example), alternatively you can send your own mask definitions created in the browser/client to the server.
The response will look like:
{
"data": {
"genotypeDataset": 1,
"description": "1000G chr22 Testing VCF",
"phenotypeDataset": 1,
"phenotype": "rand_qt",
"sigmaSquared": 0.081,
"nSamples": 2504,
"variants": [
{
"variant": "2:21228642_G/A",
"altFreq": 0.033,
"pvalue": 0.000431,
"score": 0.1
}
],
"groups": [
{
"groupType": "gene",
"group": "ENSG000001",
"mask": 1,
"variants": ["2:21228642_G/A"],
"covariance": [0.3],
}
]
}
}
The first few entries are simply repeating back the request parameters, namely genotypeDataset
, phenotypeDataset
, phenotype
.
The remaining parameters:
sigmaSquared
- this is the variance of the phenotype, in this examplerand_qt
is the phenotype.nSamples
- number of samples that went into the analysisvariants
- an array of variant objects, each containing a score statistic, p-value, and alt allele frequencygroups
- an array of group objects. Each group object is the result of calculating covariance for each combination of (group, mask). Within each group, covariance is returned as the linearized upper triangle of the covariance matrix.