-
Notifications
You must be signed in to change notification settings - Fork 19
3. treeWAS Function & Arguments
Running treeWAS
takes only one function. It requires two inputs: snps
, a matrix containing binary genetic data, and phen
, a vector containing the phenotype of each individual in your dataset. You can also use tree
to provide a phylogenetic tree if you have already built one. See Data for more details on inputs. And read Arguments below to tailor your analysis and outputs. TreeWAS should finish running within a couple of minutes, depending on the size of the dataset.
out <- treeWAS(snps = snps,
phen = phen,
tree = tree,
seed = 1)
The treeWAS
function takes the following arguments:
## Don't run this:
out <- treeWAS(snps,
phen,
tree = c("BIONJ", "NJ", "parsimony", "BIONJ*", "NJ*"),
phen.type = NULL,
n.subs = NULL,
n.snps.sim = ncol(snps)*10,
chunk.size = ncol(snps),
mem.lim = FALSE,
test = c("terminal", "simultaneous", "subsequent"),
correct.prop = FALSE,
snps.reconstruction = "parsimony",
snps.sim.reconstruction = "parsimony",
phen.reconstruction = "parsimony",
na.rm = TRUE,
p.value = 0.01,
p.value.correct = c("bonf", "fdr", FALSE),
p.value.by = c("count", "density"),
dist.dna.model = "JC69",
plot.tree = TRUE,
plot.manhattan = TRUE,
plot.null.dist = TRUE,
plot.dist = FALSE,
snps.assoc = NULL,
filename.plot = NULL,
seed = NULL)
snps
: A matrix containing binary genetic data, with individuals in the rows and genetic loci in the columns and both rows and columns labelled.
phen
: A vector containing the phenotypic state of each individual, whose length is equal to the number of rows insnps
and which is named with the same set of labels. The phenotype can be either binary (character or numeric) or continuous (numeric).
tree
: Aphylo
object containing the phylogenetic tree; or, a character string, one of"NJ"
,"BIONJ"
(the default), or"parsimony"
; or, if NAs are present in the distance matrix, one of:"NJ*"
or"BIONJ*"
, specifying the method of phylogenetic reconstruction.
phen.type
: An optional character string specifying whether the phenotypic variable should be treated as either"categorical"
,"discrete"
or"continuous"
. Ifphen.type
isNULL
(the default), ancestral state reconstructions performed via ML will treat any binary phenotype as discrete and any non-binary phenotype as continuous. Ifphen.type
is"categorical"
, ML reconstructions and association tests will treat values as nominal (not ordered) levels and not as meaningful numbers. Categorical phenotypes must have >= 3 unique values (<= 5 recommended). Ifphen.type
is"continuous"
, ML reconstructions will treat values as meaningful numbers and may infer intermediate values.
n.subs
: A numeric vector containing the homoplasy distribution (if known, see details), orNULL
(the default).
n.snps.sim
: An integer specifying the number of loci to be simulated for estimating the null distribution (by default10*ncol(snps)
). Note that 10x is the recommended minimum: where possible (i.e., for datasets that are not very large), simulating more loci (e.g.,100*ncol(snps)
) may further improve results.
chunk.size
: An integer indicating the number ofsnps
loci to be analysed at one time. This provides a solution for machines with insufficient memory to analyse the dataset at hand. Note that smaller values ofchunk.size
will increase the computational time required (e.g., forchunk.size = ncol(snps)/2
, treeWAS will take twice as long to complete).
mem.lim
: Either a number or a logical value to establish a memory limit (in GB) that will be used to automatically update thechunk.size
argument if there is not enough available memory to run treeWAS in one chunk. IfFALSE
(the default), no limit is estimated andchunk.size
is not changed. IfTRUE
, the amount of memory currently available is estimated withmemfree()
andchunk.size
is scaled back to account for the amount of memory estimated to be needed by treeWAS for this dataset. If a single numeric value, this is taken to be the amount of memory (in GB) available/designated for use by treeWAS andchunk.size
is updated to reflect this.
test
: A character string or vector containing one or more of the following available tests of association:"terminal"
,"simultaneous"
,"subsequent"
,"cor"
,"fisher"
. By default, the first three tests are run (see details).
correct.prop
: A logical indicating whether the"terminal"
and"subsequent"
tests will be corrected for phenotypic class imbalance. Recommended if the proportion of individuals varies significantly across the levels of the phenotype (if binary) or if the phenotype is skewed (if continuous). Ifcorrect.prop
isFALSE
(the default), the original version of each test is run. IfTRUE
, an alternate association metric based on the phi correlation coefficient is calculated across the terminal and all (internal and terminal) nodes, respectively.
snps.reconstruction
: Either a character string specifying"parsimony"
(the default) or"ML"
(maximum likelihood) for the ancestral state reconstruction of the genetic dataset, or a matrix containing this reconstruction if it has been performed elsewhere and you provide the tree.
snps.sim.reconstruction
: A character string specifying"parsimony"
(the default) or"ML"
(maximum likelihood) for the ancestral state reconstruction of the simulated null genetic dataset.
phen.reconstruction
: Either a character string specifying"parsimony"
(the default) or"ML"
(maximum likelihood) for the ancestral state reconstruction of the phenotypic variable, or a vector containing this reconstruction if it has been performed elsewhere.
na.rm
: A logical indicating whether to removesnps
columns if they contain more than 75% NAs (by default,TRUE
).
p.value
: A number specifying the base p-value to be set the threshold of significance (by default,0.01
).
p.value.correct
: A character string, either"bonf"
(the default) or"fdr"
, specifying whether correction for multiple testing should be performed by Bonferonni correction (recommended) or the False Discovery Rate.
p.value.by
: A character string specifying how the upper tail of the p-value distribution is to be identified. Either"count"
(the default, recommended) for a simple count-based approach or"density"
for a kernel-density based approximation.
dist.dna.model
: A character string specifying the type of model to use in reconstructing the phylogenetic tree for calculating the genetic distance between individual genomes, only used iftree
is a character string (see ?dist.dna).
plot.tree
: A logical indicating whether to generate a plot of the phylogenetic tree (TRUE
, the default) or not (FALSE
).
plot.manhattan
: A logical indicating whether to generate a manhattan plot for each association score (TRUE
, the default) or not (FALSE
).
plot.null.dist
: A logical indicating whether to plot the null distribution of association score statistics (TRUE
, the default) or not (FALSE
).
plot.dist
: A logical indicating whether to plot the true distribution of association score statistics (TRUE
) or not (FALSE
, the default).
snps.assoc
: An optional character string or vector specifying known associated loci to be demarked in results plots (e.g., from previous studies or if data is simulated); elseNULL
.
filename.plot
: An optional character string denoting the file location for saving any plots produced (eg. "C:/Home/treeWAS_plots.pdf"); elseNULL
.
seed
: An optional integer to control the pseudo-randomisation process and allow for identical repeat runs of the function; elseNULL
.