R: Model fitting for discrete comparative data

fitDiscrete {geiger}

R Documentation

Model fitting for discrete comparative data

Description

fitting macroevolutionary models to phylogenetic trees

Usage

fitDiscrete(phy, dat,
    model = c("ER","SYM","ARD","meristic"),
    transform = c("none", "EB", "lambda", "kappa", "delta", "white"),
    bounds = list(), control = list(method = c("subplex", "L-BFGS-B"),
    niter = 100, FAIL = 1e+200, hessian = FALSE, CI = 0.95), ncores=NULL,
    ...)
## S3 method for class 'gfit'
as.Qmatrix(x, ...)

Arguments

`phy`	a phylogenetic tree of class phylo
`dat`	data vector for a single trait, with names matching tips in `phy`
`model`	an Mkn model to fit to comparative data (see Details)
`transform`	an evolutionary model used to transform the tree (see Details)
`bounds`	range to constrain parameter estimates (see Details)
`control`	settings used for optimization of the model likelihood
`ncores`	Number of cores. If `NULL` then number of cores is detected
`x`	Object of class `"gfit"` for S3 method `as.Qmatrix`
`...`	if `model="meristic"`, `...` can dictate whether the matrix is asymmetric (`symmetric=FALSE`)

Details

This function fits various likelihood models for discrete character evolution. The function returns parameter estimates and the likelihood for univariate datasets. All of the models are continuous-time Markov models of trait evolution (see Yang 2006 for a good general discussion of this type of model).

The model likelihood is maximized using methods available in optim as well as subplex. Optimization methods to be used within optim can be specified through the control object.

A number of random starting points are used in optimization and are given through the niter element within the control object (e.g., control$niter). Finding the maximum likelihood fit is sometimes tricky, especially as the number of parameters in the model increases. Even in the example below, a slightly suboptimal fit is occasionally returned with the default settings fitting the general (ARD) model. There is no rule of thumb for the number of iterations that will be appropriate for a given dataset and model, but one use the variance in fitted likelihoods across iterations as an indication of the difficulty of the likelihood space (see details of the res object in Value). Twenty optimization iterations per parameter seems to be a decent starting point for fitting these models.

The FAIL value within the control object should be a large value that will be considerably far from -lnL of the maximum model likelihood. In most cases, the default setting for control$FAIL will be appropriate. The Hessian may be used to compute confidence intervals (CI) for the parameter estimates if the hessian element in control is TRUE.

The function can handle traits with any number of character states, under a range of models. The character model is specified by the model argument:

ER is an equal-rates model of where a single parameter governs all transition rates
SYM is a symmetric model where forward and reverse transitions share the same parameter
ARD is an all-rates-different model where each rate is a unique parameter
meristic is a model wherein transitions occur in a stepwise fashion (e.g., 1 to 2 to 3 to 2) without skipping intermediate steps; this requires a sensible coding of the character states as consecutive integers are assumed to be neighboring states
matrix is a user supplied model (given as a dummy matrix representing transition classes between states); elements that are zero signify rates that are also zero (see Examples)

The transform argument allows one to test models where rates vary across the tree. Bounds for the relevant parameters of the tree transform may be given through the bounds argument. Several bounds can be given at a time. Default bounds under the different models are given below. Options for transform are as follows:

none is a model of rate constancy through time
EB is the Early-burst model (Harmon et al. 2010) and also called the ACDC model (accelerating-decelerating; Blomberg et al. 2003). Set by the a rate parameter, EB fits a model where the rate of evolution increases or decreases exponentially through time, under the model r[t] = r[0] * exp(a * t), where r[0] is the initial rate, a is the rate change parameter, and t is time. Default bounds are a = c(min = -10, max = 10)
lambda is one of the Pagel (1999) models that fits the extent to which the phylogeny predicts covariance among trait values for species. The model effectively transforms the tree: values of lambda near 0 cause the phylogeny to become more star-like, and a lambda value of 1 recovers the none model. Default bounds are lambda = c(min = 0, max = 1)
kappa is a punctuational (speciational) model of trait evolution (Pagel 1999), where character divergence is related to the number of speciation events between two species. Note that if there are speciation events in the given phylogeny (due to extinction or incomplete sampling), interpretation under the kappa model may be difficult. Considered as a tree transformation, the model raises all branch lengths to an estimated power (kappa). Default bounds are kappa = c(min = 0, max = 1)
delta is a time-dependent model of trait evolution (Pagel 1999). The delta model is similar to ACDC insofar as the delta model fits the relative contributions of early versus late evolution in the tree to the covariance of species trait values. Where delta is greater than 1, recent evolution has been relatively fast; if delta is less than 1, recent evolution has been comparatively slow. Intrepreted as a tree transformation, the model raises all node depths to an estimated power (delta). Default bounds are delta = c(min = 0, max = 3)
white is a white-noise (non-phylogenetic) model, which converts the tree into a star phylogeny

Value

fitDiscrete returns a list with the following four elements:

`\bold{lik}`	is the function used to compute the model likelihood. The returned function (`lik`) takes arguments that are necessary for the given model. For instance, if estimating an untransformed `ER` model, there would be a single argument (the transition rate) necessary for the `lik` function. The tree and data are stored internally within the `lik` function, which permits those elements to be efficiently reused when computing the likelihood under different parameter values. By default, the function evaluates the likelihood of the model by weighting root states in accordance with their conditional probability given the data (this is the `"obs"` option; see FitzJohn et al. 2009). This default behavior can be changed in the call to `lik` with `lik(pars, root="flat")`, for instance, which would weight each state equally at the root. The other useful option is `"given"`, where the user must also supply a vector (`root.p`) of probabilities for each possible state. To make likelihoods roughly comparable between geiger and ape, one should use the option `lik(pars, root="given", root.p=rep(1,k))`, where `k` is the number of character states. See Examples for a demonstration
`\bold{bnd}`	is a matrix of the used bounds for the relevant parameters estimated in the model. Warnings will be issued if any parameter estimates occur at the supplied (or default) parameter bounds
`\bold{res}`	is a matrix of results from optimization. Rownames of the `res` matrix are the optimization methods (see `optim` and `subplex`). The columns in the `res` matrix are the estimated parameter values, the estimated model likelihood, and an indication of optimization convergence. Values of convergence not equal to zero are not to be trusted
`\bold{opt}`	is a list of the primary results: estimates of the parameters, the maximum-likelihood estimate (`lnL`) of the model, the optimization method used to compute the MLE, the number of model parameters (`k`, including one parameter for the root state), the AIC (`aic`), sample-size corrected AIC (`aicc`). The number of observations for AIC computation is taken to be the number of trait values observed. If the Hessian is used, confidence intervals on the parameter estimates (`CI`) and the Hessian matrix (`hessian`) are also returned

Note

To speed the likelihood search, one may set an environment variable to make use of parallel processing, used by mclapply. To set the environment variable, use options(mc.cores=INTEGER), where INTEGER is the number of available cores. Alternatively, the mc.cores variable may be preset upon the initiation of an R session (see Startup for details).

Author(s)

LJ Harmon, RE Glor, RG FitzJohn, and JM Eastman

References

Yang Z. 2006. Computational Molecular Evolution. Oxford University Press: Oxford. FitzJohn RG, WP Maddison, and SP Otto. 2009. Estimating trait-dependent speciation and extinction rates from incompletely resolved molecular phylogenies. Systematic Biology 58:595-611.

Examples

## Not run: 
## match data and tree
tmp=get(data(geospiza))
td=treedata(tmp$phy, tmp$dat)
geo=list(phy=td$phy, dat=td$data)
gb=round(geo$dat[,5]) ## create discrete data
names(gb)=rownames(geo$dat)

tmp=fitDiscrete(geo$phy, gb, model="ER", control=list(niter=5), ncores=2) #-7.119792
## using the returned likelihood function
lik=tmp$lik
lik(0.3336772, root="obs") #-7.119792
lik(0.3336772, root="flat") #-8.125354
lik(0.3336772, root="given", root.p=rep(1/3,3)) #-8.125354
lik(0.3336772, root="given", root.p=c(0, 1, 0)) #-7.074039
lik(c(0.3640363), root="given", root.p=rep(1,3)) #-7.020569 & comparable to ape:::ace solution

## End(Not run)

# general model (ARD)
## match data and tree
tmp=get(data(geospiza))
td=treedata(tmp$phy, tmp$dat)
geo=list(phy=td$phy, dat=td$data)
gb=round(geo$dat[,5]) ## create discrete data
names(gb)=rownames(geo$dat)
fitDiscrete(geo$phy, gb, model="ARD", ncores=1) #-6.064573

# user-specified rate classes
mm=rbind(c(NA, 0, 0), c(1, NA, 2), c(0, 2, NA))
fitDiscrete(geo$phy, gb, model=mm, ncores=1) #-7.037944

# symmetric-rates model
fitDiscrete(geo$phy, gb, model="SYM", ncores=1)#-6.822943

[Package geiger version 2.0.11 Index]