fitDiscrete {geiger}R Documentation

Model fitting for discrete comparative data

Description

fitting macroevolutionary models to phylogenetic trees

Usage

fitDiscrete(phy, dat,
    model = c("ER","SYM","ARD","meristic"),
    transform = c("none", "EB", "lambda", "kappa", "delta", "white"),
    bounds = list(), control = list(method = c("subplex", "L-BFGS-B"),
    niter = 100, FAIL = 1e+200, hessian = FALSE, CI = 0.95), ncores=NULL,
    ...)
## S3 method for class 'gfit'
as.Qmatrix(x, ...)

Arguments

phy

a phylogenetic tree of class phylo

dat

data vector for a single trait, with names matching tips in phy

model

an Mkn model to fit to comparative data (see Details)

transform

an evolutionary model used to transform the tree (see Details)

bounds

range to constrain parameter estimates (see Details)

control

settings used for optimization of the model likelihood

ncores

Number of cores. If NULL then number of cores is detected

x

Object of class "gfit" for S3 method as.Qmatrix

...

if model="meristic", ... can dictate whether the matrix is asymmetric (symmetric=FALSE)

Details

This function fits various likelihood models for discrete character evolution. The function returns parameter estimates and the likelihood for univariate datasets. All of the models are continuous-time Markov models of trait evolution (see Yang 2006 for a good general discussion of this type of model).

The model likelihood is maximized using methods available in optim as well as subplex. Optimization methods to be used within optim can be specified through the control object.

A number of random starting points are used in optimization and are given through the niter element within the control object (e.g., control$niter). Finding the maximum likelihood fit is sometimes tricky, especially as the number of parameters in the model increases. Even in the example below, a slightly suboptimal fit is occasionally returned with the default settings fitting the general (ARD) model. There is no rule of thumb for the number of iterations that will be appropriate for a given dataset and model, but one use the variance in fitted likelihoods across iterations as an indication of the difficulty of the likelihood space (see details of the res object in Value). Twenty optimization iterations per parameter seems to be a decent starting point for fitting these models.

The FAIL value within the control object should be a large value that will be considerably far from -lnL of the maximum model likelihood. In most cases, the default setting for control$FAIL will be appropriate. The Hessian may be used to compute confidence intervals (CI) for the parameter estimates if the hessian element in control is TRUE.

The function can handle traits with any number of character states, under a range of models. The character model is specified by the model argument:

The transform argument allows one to test models where rates vary across the tree. Bounds for the relevant parameters of the tree transform may be given through the bounds argument. Several bounds can be given at a time. Default bounds under the different models are given below. Options for transform are as follows:

Value

fitDiscrete returns a list with the following four elements:

\bold{lik}

is the function used to compute the model likelihood. The returned function (lik) takes arguments that are necessary for the given model. For instance, if estimating an untransformed ER model, there would be a single argument (the transition rate) necessary for the lik function. The tree and data are stored internally within the lik function, which permits those elements to be efficiently reused when computing the likelihood under different parameter values. By default, the function evaluates the likelihood of the model by weighting root states in accordance with their conditional probability given the data (this is the "obs" option; see FitzJohn et al. 2009). This default behavior can be changed in the call to lik with lik(pars, root="flat"), for instance, which would weight each state equally at the root. The other useful option is "given", where the user must also supply a vector (root.p) of probabilities for each possible state. To make likelihoods roughly comparable between geiger and ape, one should use the option lik(pars, root="given", root.p=rep(1,k)), where k is the number of character states. See Examples for a demonstration

\bold{bnd}

is a matrix of the used bounds for the relevant parameters estimated in the model. Warnings will be issued if any parameter estimates occur at the supplied (or default) parameter bounds

\bold{res}

is a matrix of results from optimization. Rownames of the res matrix are the optimization methods (see optim and subplex). The columns in the res matrix are the estimated parameter values, the estimated model likelihood, and an indication of optimization convergence. Values of convergence not equal to zero are not to be trusted

\bold{opt}

is a list of the primary results: estimates of the parameters, the maximum-likelihood estimate (lnL) of the model, the optimization method used to compute the MLE, the number of model parameters (k, including one parameter for the root state), the AIC (aic), sample-size corrected AIC (aicc). The number of observations for AIC computation is taken to be the number of trait values observed. If the Hessian is used, confidence intervals on the parameter estimates (CI) and the Hessian matrix (hessian) are also returned

Note

To speed the likelihood search, one may set an environment variable to make use of parallel processing, used by mclapply. To set the environment variable, use options(mc.cores=INTEGER), where INTEGER is the number of available cores. Alternatively, the mc.cores variable may be preset upon the initiation of an R session (see Startup for details).

Author(s)

LJ Harmon, RE Glor, RG FitzJohn, and JM Eastman

References

Yang Z. 2006. Computational Molecular Evolution. Oxford University Press: Oxford. FitzJohn RG, WP Maddison, and SP Otto. 2009. Estimating trait-dependent speciation and extinction rates from incompletely resolved molecular phylogenies. Systematic Biology 58:595-611.

Examples

## Not run: 
## match data and tree
tmp=get(data(geospiza))
td=treedata(tmp$phy, tmp$dat)
geo=list(phy=td$phy, dat=td$data)
gb=round(geo$dat[,5]) ## create discrete data
names(gb)=rownames(geo$dat)

tmp=fitDiscrete(geo$phy, gb, model="ER", control=list(niter=5), ncores=2) #-7.119792
## using the returned likelihood function
lik=tmp$lik
lik(0.3336772, root="obs") #-7.119792
lik(0.3336772, root="flat") #-8.125354
lik(0.3336772, root="given", root.p=rep(1/3,3)) #-8.125354
lik(0.3336772, root="given", root.p=c(0, 1, 0)) #-7.074039
lik(c(0.3640363), root="given", root.p=rep(1,3)) #-7.020569 & comparable to ape:::ace solution

## End(Not run)

# general model (ARD)
## match data and tree
tmp=get(data(geospiza))
td=treedata(tmp$phy, tmp$dat)
geo=list(phy=td$phy, dat=td$data)
gb=round(geo$dat[,5]) ## create discrete data
names(gb)=rownames(geo$dat)
fitDiscrete(geo$phy, gb, model="ARD", ncores=1) #-6.064573

# user-specified rate classes
mm=rbind(c(NA, 0, 0), c(1, NA, 2), c(0, 2, NA))
fitDiscrete(geo$phy, gb, model=mm, ncores=1) #-7.037944

# symmetric-rates model
fitDiscrete(geo$phy, gb, model="SYM", ncores=1)#-6.822943

[Package geiger version 2.0.11 Index]