fit_hbd_pdr_on_grid {castor}R Documentation

Fit pulled diversification rates of birth-death models on a time grid.

Description

Given an ultrametric timetree, estimate the pulled diversification rate of homogenous birth-death (HBD) models that best explains the tree via maximum likelihood. Every HBD model is defined by some speciation and extinction rates (λ and μ) over time, as well as the sampling fraction ρ (fraction of extant species sampled). “Homogenous” refers to the assumption that, at any given moment in time, all lineages exhibit the same speciation/extinction rates. For any given HBD model there exists an infinite number of alternative HBD models that predict the same deterministic lineages-through-time curve and yield the same likelihood for any given reconstructed timetree; these “congruent” models cannot be distinguished from one another solely based on the tree.

Each congruence class is uniquely described by the “pulled diversification rate” (PDR; Louca et al 2018), defined as PDR=λ-μ+λ^{-1}dλ/dτ (where τ is time before present) as well as the product ρλ_o (where λ_o is the present-day speciation rate). That is, two HBD models are congruent if and only if they have the same PDR and the same product ρλ_o. This function is designed to estimate the generating congruence class for the tree, by fitting the PDR on a grid of discrete times as well as the product ρλ_o.

Usage

fit_hbd_pdr_on_grid(  tree, 
                      oldest_age            = NULL,
                      age0                  = 0,
                      age_grid              = NULL,
                      min_PDR               = -Inf,
                      max_PDR               = +Inf,
                      min_rholambda0        = 1e-10,
                      max_rholambda0        = +Inf,
                      guess_PDR             = NULL,
                      guess_rholambda0      = NULL,
                      fixed_PDR             = NULL,
                      fixed_rholambda0      = NULL,
                      splines_degree        = 1,
                      condition             = "auto",
                      relative_dt           = 1e-3,
                      Ntrials               = 1,
                      Nbootstraps           = 0,
                      Ntrials_per_bootstrap = NULL,
                      Nthreads              = 1,
                      max_model_runtime     = NULL,
                      fit_control           = list(),
                      verbose               = FALSE,
                      verbose_prefix        = "")

Arguments

tree

An ultrametric timetree of class "phylo", representing the time-calibrated phylogeny of a set of extant species.

oldest_age

Strictly positive numeric, specifying the oldest time before present (“age”) to consider when calculating the likelihood. If this is equal to or greater than the root age, then oldest_age is taken as the stem age, and the classical formula by Morlon et al. (2011) is used. If oldest_age is less than the root age, the tree is split into multiple subtrees at that age by treating every edge crossing that age as the stem of a subtree, and each subtree is considered an independent realization of the HBD model stemming at that age. This can be useful for avoiding points in the tree close to the root, where estimation uncertainty is generally higher. If oldest_age==NULL, it is automatically set to the root age.

age0

Non-negative numeric, specifying the youngest age (time before present) to consider for fitting, and with respect to which rholambda0 is defined. If age0>0, then rholambda0 refers to the product of the sampling fraction at age age0 and the speciation rate at age age0. See below for more details.

age_grid

Numeric vector, listing ages in ascending order at which the PDR is allowed to vary independently. This grid must cover at least the age range from age0 to oldest_age. If NULL or of length <=1 (regardless of value), then the PDR is assumed to be time-independent.

min_PDR

Numeric vector of length Ngrid (=max(1,length(age_grid))), or a single numeric, specifying lower bounds for the fitted PDR at each point in the age grid. If a single numeric, the same lower bound applies at all ages. Use -Inf to omit lower bounds.

max_PDR

Numeric vector of length Ngrid, or a single numeric, specifying upper bounds for the fitted PDR at each point in the age grid. If a single numeric, the same upper bound applies at all ages. Use +Inf to omit upper bounds.

min_rholambda0

Strictly positive numeric, specifying the lower bound for the fitted ρλ_o (sampling fraction times present-day extinction rate).

max_rholambda0

Strictly positive numeric, specifying the upper bound for the fitted ρλ_o. Set to +Inf to omit this upper bound.

guess_PDR

Initial guess for the PDR at each age-grid point. Either NULL (an initial guess will be computed automatically), or a single numeric (guessing the same PDR at all ages) or a numeric vector of size Ngrid specifying a separate guess at each age-grid point. To omit an initial guess for some but not all age-grid points, set their guess values to NA. Guess values are ignored for non-fitted (i.e., fixed) parameters.

guess_rholambda0

Numeric, specifying an initial guess for the product ρλ_o. If NULL, a guess will be computed automatically.

fixed_PDR

Optional fixed (i.e. non-fitted) PDR values on one or more age-grid points. Either NULL (PDR is not fixed anywhere), or a single numeric (PDR fixed to the same value at all grid points) or a numeric vector of size Ngrid (PDR fixed at one or more age-grid points, use NA for non-fixed values).

fixed_rholambda0

Numeric, optionally specifying a fixed value for the product ρλ_o. If NULL or NA, the product ρλ_o is estimated.

splines_degree

Integer between 0 and 3 (inclusive), specifying the polynomial degree of the PDR between age-grid points. If 0, then the PDR is considered piecewise constant, if 1 then the PDR is considered piecewise linear, if 2 or 3 then the PDR is considered to be a spline of degree 2 or 3, respectively. The splines_degree influences the analytical properties of the curve, e.g. splines_degree==1 guarantees a continuous curve, splines_degree==2 guarantees a continuous curve and continuous derivative, and so on. A degree of 0 is generally not recommended.

condition

Character, either "crown", "stem" or "auto", specifying on what to condition the likelihood. If "crown", the likelihood is conditioned on the survival of the two daughter lineages branching off at the root. If "stem", the likelihood is conditioned on the survival of the stem lineage. Note that "crown" really only makes sense when oldest_age is equal to the root age, while "stem" is recommended if oldest_age differs from the root age. If "auto", the condition is chosen according to the recommendations mentioned earlier.

relative_dt

Strictly positive numeric (unitless), specifying the maximum relative time step allowed for integration over time, when calculating the likelihood. Smaller values increase integration accuracy but increase computation time. Typical values are 0.0001-0.001. The default is usually sufficient.

Ntrials

Integer, specifying the number of independent fitting trials to perform, each starting from a random choice of model parameters. Increasing Ntrials reduces the risk of reaching a non-global local maximum in the fitting objective.

Nbootstraps

Integer, specifying an optional number of bootstrap samplings to perform, for estimating standard errors and confidence intervals of maximum-likelihood fitted parameters. If 0, no bootstrapping is performed. Typical values are 10-100. At each bootstrap sampling, a random timetree is generated under the birth-death model according to the fitted PDR and ρλ_o, the parameters are estimated anew based on the generated tree, and subsequently compared to the original fitted parameters. Each bootstrap sampling will use roughly the same information and similar computational resources as the original maximum-likelihood fit (e.g., same number of trials, same optimization parameters, same initial guess, etc).

Ntrials_per_bootstrap

Integer, specifying the number of fitting trials to perform for each bootstrap sampling. If NULL, this is set equal to max(1,Ntrials). Decreasing Ntrials_per_bootstrap will reduce computation time, at the expense of potentially inflating the estimated confidence intervals; in some cases (e.g., for very large trees) this may be useful if fitting takes a long time and confidence intervals are very narrow anyway. Only relevant if Nbootstraps>0.

Nthreads

Integer, specifying the number of parallel threads to use for performing multiple fitting trials simultaneously. This should generally not exceed the number of available CPUs on your machine. Parallel computing is not available on the Windows platform.

max_model_runtime

Optional numeric, specifying the maximum number of seconds to allow for each evaluation of the likelihood function. Use this to abort fitting trials leading to parameter regions where the likelihood takes a long time to evaluate (these are often unlikely parameter regions).

fit_control

Named list containing options for the nlminb optimization routine, such as iter.max, eval.max or rel.tol. For a complete list of options and default values see the documentation of nlminb in the stats package.

verbose

Logical, specifying whether to print progress reports and warnings to the screen. Note that errors always cause a return of the function (see return values success and error).

verbose_prefix

Character, specifying the line prefix for printing progress reports to the screen.

Details

If age0>0, the input tree is essentially trimmed at age0 (omitting anything younger than age0), and the PDR and rholambda0 are fitted to this new (shorter) tree, with time shifted appropriately. The fitted rholambda0 is thus the product of the sampling fraction at age0 and the speciation rate at age0. Note that the sampling fraction at age0 is simply the fraction of lineages extant at age0 that are represented in the timetree.

It is generally advised to provide as much information to the function fit_hbd_pdr_on_grid as possible, including reasonable lower and upper bounds (min_PDR, max_PDR, min_rholambda0 and max_rholambda0) and a reasonable parameter guess (guess_PDR and guess_rholambda0). It is also important that the age_grid is sufficiently fine to capture the expected major variations of the PDR over time, but keep in mind the serious risk of overfitting when age_grid is too fine and/or the tree is too small.

Value

A list with the following elements:

success

Logical, indicating whether model fitting succeeded. If FALSE, the returned list will include an additional “error” element (character) providing a description of the error; in that case all other return variables may be undefined.

objective_value

The maximized fitting objective. Currently, only maximum-likelihood estimation is implemented, and hence this will always be the maximized log-likelihood.

objective_name

The name of the objective that was maximized during fitting. Currently, only maximum-likelihood estimation is implemented, and hence this will always be “loglikelihood”.

loglikelihood

The log-likelihood of the fitted model for the given timetree.

fitted_PDR

Numeric vector of size Ngrid, listing fitted or fixed pulled diversification rates (PDR) at each age-grid point. Between grid points the fitted PDR should be interpreted as a piecewise polynomial function (natural spline) of degree splines_degree; to evaluate this function at arbitrary ages use the castor routine evaluate_spline.

fitted_rholambda0

Numeric, specifying the fitted or fixed product ρλ(0).

guess_PDR

Numeric vector of size Ngrid, specifying the initial guess for the PDR at each age-grid point.

guess_rholambda0

Numeric, specifying the initial guess for ρλ(0).

age_grid

The age-grid on which the PDR is defined. This will be the same as the provided age_grid, unless the latter was NULL or of length <=1.

NFP

Integer, number of fitted (i.e., non-fixed) parameters. If none of the PDRs or ρλ0 were fixed, this will be equal to Ngrid+1.

AIC

The Akaike Information Criterion for the fitted model, defined as 2k-2\log(L), where k is the number of fitted parameters and L is the maximized likelihood.

BIC

The Bayesian information criterion for the fitted model, defined as \log(n)k-2\log(L), where k is the number of fitted parameters, n is the number of data points (number of branching times), and L is the maximized likelihood.

converged

Logical, specifying whether the maximum likelihood was reached after convergence of the optimization algorithm. Note that in some cases the maximum likelihood may have been achieved by an optimization path that did not yet converge (in which case it's advisable to increase iter.max and/or eval.max).

Niterations

Integer, specifying the number of iterations performed during the optimization path that yielded the maximum likelihood.

Nevaluations

Integer, specifying the number of likelihood evaluations performed during the optimization path that yielded the maximum likelihood.

bootstrap_estimates

If Nbootstraps>0, this will be a named list containing the elements PDR (numeric matrix of size Nbootstraps x Ngrid, listing the fitted PDR at each grid point and for each bootstrap) and rholambda0 (a numeric vector of size Nbootstraps, listing the fitted ρλ_o for each bootstrap).

standard_errors

If Nbootstraps>0, this will be a named list containing the elements PDR (numeric vector of size Ngrid, listing bootstrap-estimated standard errors for the fitted PDRs) and rholambda0 (a single numeric, bootstrap-estimated standard error for the fitted ρλ_o).

medians

If Nbootstraps>0, this will be a named list containing the elements PDR (numeric vector of size Ngrid, listing median fitted PDRs across bootstraps) and rholambda0 (a single numeric, median fitted ρλ_o across bootstraps).

CI50lower

If Nbootstraps>0, this will be a named list containing the elements PDR (numeric vector of size Ngrid, listing bootstrap-estimated lower bounds of the 50-percent confidence intervals for the fitted PDRs) and rholambda0 (a single numeric, bootstrap-estimated lower bound of the 50-percent confidence intervals for the fitted ρλ_o).

CI50upper

Similar to CI50lower, listing upper bounds of 50-percentile confidence intervals.

CI95lower

Similar to CI50lower, listing lower bounds of 95-percentile confidence intervals.

CI95upper

Similar to CI95lower, listing upper bounds of 95-percentile confidence intervals.

Author(s)

Stilianos Louca

References

S. Louca et al. (2018). Bacterial diversification through geological time. Nature Ecology & Evolution. 2:1458-1467.

See Also

simulate_deterministic_hbd

loglikelihood_hbd

fit_hbd_model_parametric

fit_hbd_model_on_grid

fit_hbd_pdr_parametric

Examples

## Not run: 
# Generate a random tree with exponentially varying lambda & mu
Ntips     = 10000
rho       = 0.5 # sampling fraction
time_grid = seq(from=0, to=100, by=0.01)
lambdas   = 2*exp(0.1*time_grid)
mus       = 1.5*exp(0.09*time_grid)
sim       = generate_random_tree( parameters  = list(rarefaction=rho), 
                                  max_tips    = Ntips/rho,
                                  coalescent  = TRUE,
                                  added_rates_times     = time_grid,
                                  added_birth_rates_pc  = lambdas,
                                  added_death_rates_pc  = mus)
tree = sim$tree
root_age = castor::get_tree_span(tree)$max_distance
cat(sprintf("Tree has %d tips, spans %g Myr\n",length(tree$tip.label),root_age))

# calculate true PDR
lambda_slopes = diff(lambdas)/diff(time_grid);
lambda_slopes = c(lambda_slopes[1],lambda_slopes)
PDRs = lambdas - mus - (lambda_slopes/lambdas)


# Fit PDR on grid
Ngrid     = 10
age_grid  = seq(from=0,to=root_age,length.out=Ngrid)
fit = fit_hbd_pdr_on_grid(tree, 
          age_grid    = age_grid,
          min_PDR     = -100,
          max_PDR     = +100,
          condition   = "crown",
          Ntrials     = 10,	# perform 10 fitting trials
          Nthreads    = 2,	# use two CPUs
          max_model_runtime = 1) 	# limit model evaluation to 1 second
if(!fit$success){
  cat(sprintf("ERROR: Fitting failed: %s\n",fit$error))
}else{
  cat(sprintf("Fitting succeeded:\nLoglikelihood=%g\n",fit$loglikelihood))
  
  # plot fitted & true PDR
  plot( x     = fit$age_grid,
        y     = fit$fitted_PDR,
        main  = 'Fitted & true PDR',
        xlab  = 'age',
        ylab  = 'PDR',
        type  = 'b',
        col   = 'red',
        xlim  = c(root_age,0)) 
  lines(x     = sim$final_time-time_grid,
        y     = PDRs,
        type  = 'l',
        col   = 'blue');
        
  # get fitted PDR as a function of age
  PDR_fun = approxfun(x=fit$age_grid, y=fit$fitted_PDR)
}

## End(Not run)

[Package castor version 1.6.7 Index]