Date: | 22 Dec 2016 |
---|---|
Authors: | Gao He
Vangelis Daivd M Antonio B |
- The plan is to create a QC pipeline based on Plink and other tools that can be automated using Ruffus and CGAT tools.
- Based on Gao's and other work the genotype data looks very good in general.
- CGAT, ruffus, drmaa are in place and generally working with a few bumps.
- We have most of the scripts we need although they require a lot of cleaning up.
- Gao, Vangelis and I have discussed and agreed the steps to follow. These basically mirror de UK Biobank set up.
- There is a pipeline in progress in the src folder with a dummy report in the report folder (based on CGAT's quickstart pipeline).
- After the QC steps we will discuss the need for an imputation pipeline and what to do with the different platforms. i.e. Merge them, run as validation, as meta-analysis, etc. This may be project/question dependent though.
- For imputation it would be ideal to do it ourselves. This needs person programming time and accessing the latest reference panel (we should try to get the HRC, Vangelis is looking into this).
- Most of the scripts (from Gao and I and third party) are ready but need cleaning up, making sure they are callable and put into a pipeline.
- I've marked what needs to be done below with "TO DO" and a short explanantion, mostly either creating the stand-alone script for that step or checking the existing one and transferring the command into the ruffus pipeline or cleaning it up/turning into a script.
- Check travis setup, is pep8 running? flake8 gives errors and log at travis
- Update external dependencies file
- Keep track of installations for Docker, David has an initial dockerfile
.. todo::
| TO DO: Location of basic phenotype data | TO DO: location of current genotype data | TO DO: location of WTCHG Core Facility reports
Note
There are three platforms: Exome, CoreExome and an Affy chip
.. todo:: TO DO: location of each
Note
Plates 1-27 are batch 1, 28-53 are batch 2.
Gao has done cleaning up, formatting, sanity checks, etc. already. | See emails
Subject: RE: airwave QC discussion From: "Gao, He" <[email protected]> Date: 19/09/2016, 17:17 To: "Berlanga, Antonio J" <[email protected]> CC: "Evangelou, Evangelos" <[email protected]>
- updated the alleles (from A/B to TOP/BOT annotation), strand and genome build.
- updated new plink files
- calculated the sample and SNP missingness and HWE p-values PER BATCH. Sample missingness is very low in general. SNP missingness looks ok for all batches except for n23 and n46. (I think when calculating the SNP call rates, we need to exclude the non-autosomal SNPs.)
- fixed the four batches and the plink files have been generated. In case you need to run any loop there should be no problem now.
- calculated the number of samples and some full batches do not have 288 samples in the intensity files (and therefore also in the plink files).
.. todo:: TO DO: more emails missing, update/transfer info here for clarity and follow-up
Files to check to pull out commands for most of the steps below: | plink_preprocessing.txt | QC_plink_individuals.sh | markers_QC_Airwave.sh
These scripts were run as QC of markers first, then individual samples. Steps in this pipeline follow the UK Biobank protocol (which goes back and forth between markers and individuals as it first creates a homogeneous group [based on ethnicity] from where to pull out high quality genetic markers which are not confounded by population stratification.
Note
Keep scripts, modules, pipelines separate.
.. todo:: TO DO: scan/ppt pipeline workflow plus notes
- Pre-QC steps, GenomeStudio to plink, hg19 liftover, flip strand:
- GenomeStudio to plink: by zcall script:
- Script: /groupvol/med-bio/epiUKB/Airwave/coreExome_zcall/zcall_v3.4/convertReportToTPED.py
- Job submission script: /groupvol/med-bio/epiUKB/Airwave/coreExome_genotype/plinkFiles/1_convertReportToTPED.pbs
- Result files: /groupvol/med-bio/epiUKB/Airwave/coreExome_genotype/plinkFiles
- Convert from AB allele to illumina TOP/BOT annotation: by plink, using Wrayner's annotation files
- Strand files: /groupvol/med-bio/epiUKB/Airwave/strandFiles
(from http://www.well.ox.ac.uk/~wrayner/strand/) + Command: plink --noweb --bfile --update-alleles humancoreexome-12v1-1_a.update_alleles.txt --make-bed --out
- Update genome build: hg19/build 37 liftover: by plink, using Wrayner's annotation files, also handles strand
- This includes updating a few attributes (chromosome, position, strand flipping etc)
- Script: http://www.well.ox.ac.uk/~wrayner/strand/update_build.sh
Allele frequency report with proportions:
- TO DO write commands into ruffus pipeline, e.g. (see also sh scripts above):
- plink2 --bifle xxx --freq
- cat plink.frq | tr -s ' ' 't' | cut -f 4 | grep A | wc -l # First column is a tab, so fourth is A1
Select homogeneous set of samples to use as set for marker QC (PCA based, with automatic selection using e.g. 'aberrant' R package. This is to avoid artefacts from population structure. Excluded samples are later re-introduced.):
- http://bioinformatics.oxfordjournals.org/content/28/1/134.full.pdf+html
- Use summary statistics, and/or: missingness, ancestry, probe intensity, gender separately:
- TO DO write commands into ruffus pipeline:
- Merge plates first
- TO DO write commands into ruffus pipeline (see scripts above although PCA tool needs changing to FlashPCA probably as older tools won't run on large number of samples):
- Run PCA against 1000G (or Hapmap) as in UKB appendix 1 (requires using plink MAF >5%, HWE 10^-6, etc for Hapmap or 1000G, then projecting onto these)
- TO DO write script to wrap aberrant and make it callable from CLI within pipeline:
- aberrant with lambda set to 20 for ancestry PC1 and PC2 as summary stats
Per batch marker QC (plink commands; drop failing SNPs from all plates):
- TO DO write script for this, needs loop calling batch 1 vs all other batches, then batch 2 vs all other batches, etc. with parameters (eg p-values and all the criteria below) can be set by user:
- Exclude monomorphic SNPs
- Genotype call rate (<98%)
- Genotype frequency consistency across batches (Fisher's exact test p-value <10^-12)
- Allele frequency consistency versus reference panel (eg Hapmap, Fisher's exact test p-value <10^-12)
- Hardy Weinberg equilibrium (p-value <10^-12)
- Plate/batch PCA (visual outlier detection check)
- Plate/batch merge
Visual test of genotype calls in cluster plots (bin by MAF, pick random subset)
TO DO write scripts for this: Gao has plotted these before and I think has scripts. Obviously can't check thousands of SNPs visually svo either use a random pick (e.g. grab 20 or whatever is plottable) or better grab top 10 highest quality SNPs, bottom 10, 10 failed SNPs, 10 at MAF > 10%, 10 at 1-5%, 10 <1%, etc. The aim is to have some visual sanity check of the raw data for some of the markers.
Pooled sample QC (all samples; based on high quality set of markers from above; plink commands):
TO DO these are plink commands that can be put directly into the ruffus pipeline with a PARAMS config option so user can set different cut-offs (these PARAMS and config file are standard for CGAT pipelines):
- Run with autosomal SNPs only
- Heterozygosity (standard deviation > +/- 3) and genotype failure rates per individual (>5%)
- Relatedness between individuals (IBD cut-off >0.185)
- Gender mis-identification check
VCF check sanity (strand, problematic SNPs, etc.)
TO DO look up tools and insert command into Ruffus, these already exist, plink2 has commands for this.
The output of the genotpe QC pipeline should be input for:
- Imputation + post-imputation QC (discuss this at a later stage)
- If several platforms, re-run whole script with each platform as separate batch
- Final set of QC'd and imputed individuals and markers can then be processed depending on specific question, eg:
- Filter out MAF <1%
- Ancestry PCA (e.g. vs Hapmap populations), filter out on PC2 values <0.072
- Filter out mitochondrial and sex chromosomes
- VCF check sanity (strand, problematic SNPs, etc.)
General protocols and references: | http://www.ukbiobank.ac.uk/wp-content/uploads/2014/04/UKBiobank_genotyping_QC_documentation-web.pdf | http://www.nature.com/nprot/journal/v5/n9/pdf/nprot.2010.116.pdf | http://www.nature.com/nprot/journal/v10/n9/pdf/nprot.2015.077.pdf | http://www.nature.com/ng/journal/vaop/ncurrent/pdf/ng.3656.pdf
Also see: | Quality control and conduct of genome-wide association meta-analyses | http://www.nature.com/nprot/journal/v9/n5/full/nprot.2014.071.html
.. todo:: move this to the next pipeline