You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following table tabulates features of various different formats:
PLINK binary
GEN
BGEN v1.1
BGEN v1.2 / v1.3
VCF
BCF
Supports unphased genotype calls
✓
✓*
✓*
✓
✓
✓
Supports unphased genotype probabilities
✓
✓
✓
✓
✓
Supports NULL/outlier probability e.g. NULL class from CHIAMO / GenoSNP
✓
✓
✓
✓
Supports non-diploid samples
†
†
✓
✓‡
✓‡
Supports phased data?
✓
✓‡
✓‡
Supports multi-allelic variants
✓
✓
✓
Efficient representation?
✓
✓
✓
✓
Hard-called genotypes are converted to probabilities in GEN and BGEN v1.1. †By convention, males on the X chromosome are stored as homozygote females in GEN and BGEN v1.1. ‡At the time of writing, the storage of genotype likelihoods and probabilities for non-diploid samples and/or phased data in VCF/BCF is not fully specified.
It is also important how quickly file formats can be streamed for parallel processing. Binary formats typically do no better than compressed textual data here. I see that as a too early optimization ;).
I suspect for GEMMA we end up with our own R/qtl2 based format and convert from one of the above.
Computing probabilities is something we like to control. Also it is not a great idea to have GEMMA support multiple formats for reasons of maintenance. One type is enough. Conversion will be rapid so we can pipe it in.
In
GEMMA
, bgen support was added in PR.However, there are no tests to validate the code so that I can port it to
faster_lmm_d
.I need to test BGEN files with a 500k sample. I believe this would be a great exercise to test GPU support.
PS: This thread tracks the implementation of BGEN file support.
The text was updated successfully, but these errors were encountered: