R: Fit pulled diversification rates of birth-death models on a...

fit_hbd_pdr_on_best_grid_size {castor}

R Documentation

Fit pulled diversification rates of birth-death models on a time grid with optimal size.

Description

Given an ultrametric timetree, estimate the pulled diversification rate of homogenous birth-death (HBD) models that best explains the tree via maximum likelihood, automatically determining the optimal time-grid size based on the data. Every HBD model is defined by some speciation and extinction rates (\lambda and \mu) over time, as well as the sampling fraction \rho (fraction of extant species sampled). “Homogenous” refers to the assumption that, at any given moment in time, all lineages exhibit the same speciation/extinction rates. For any given HBD model there exists an infinite number of alternative HBD models that predict the same deterministic lineages-through-time curve and yield the same likelihood for any given reconstructed timetree; these “congruent” models cannot be distinguished from one another solely based on the tree.

Each congruence class is uniquely described by the “pulled diversification rate” (PDR; Louca et al 2018), defined as PDR=\lambda-\mu+\lambda^{-1}d\lambda/d\tau (where \tau is time before present) as well as the product \rho\lambda_o (where \lambda_o is the present-day speciation rate). That is, two HBD models are congruent if and only if they have the same PDR and the same product \rho\lambda_o. This function is designed to estimate the generating congruence class for the tree, by fitting the PDR on a grid of discrete times as well as the product \rho\lambda_o. Internally, the function uses fit_hbd_pdr_on_grid to perform the fitting. The "best" grid size is determined based on some optimality criterion, such as AIC.

Usage

fit_hbd_pdr_on_best_grid_size(tree,
                              oldest_age            = NULL,
                              age0                  = 0,
                              grid_sizes            = c(1,10),
                              uniform_grid          = FALSE,
                              criterion             = "AIC",
                              exhaustive            = TRUE,
                              min_PDR               = -Inf,
                              max_PDR               = +Inf,
                              min_rholambda0        = 1e-10,
                              max_rholambda0        = +Inf,
                              guess_PDR             = NULL,
                              guess_rholambda0      = NULL,
                              fixed_PDR             = NULL,
                              fixed_rholambda0      = NULL,
                              splines_degree        = 1,
                              condition             = "auto",
                              relative_dt           = 1e-3,
                              Ntrials               = 1,
                              Nbootstraps           = 0,
                              Ntrials_per_bootstrap = NULL,
                              Nthreads              = 1,
                              max_model_runtime	    = NULL,
                              fit_control           = list(),
                              verbose               = FALSE,
                              verbose_prefix        = "")

Arguments

`tree`	A rooted ultrametric timetree of class "phylo", representing the time-calibrated phylogeny of a set of extant sampled species.
`oldest_age`	Strictly positive numeric, specifying the oldest time before present (“age”) to consider when calculating the likelihood. If this is equal to or greater than the root age, then `oldest_age` is taken as the stem age. If `oldest_age` is less than the root age, the tree is split into multiple subtrees at that age by treating every edge crossing that age as the stem of a subtree, and each subtree is considered an independent realization of the HBD model stemming at that age. This can be useful for avoiding points in the tree close to the root, where estimation uncertainty is generally higher. If `oldest_age==NULL`, it is automatically set to the root age.
`age0`	Non-negative numeric, specifying the youngest age (time before present) to consider for fitting. If `age0>0`, the tree essentially is trimmed at `age0`, omitting anything younger than `age0`, and the PDR and `\rho\lambda_o` are fitted to the trimmed tree while shifting time appropriately.
`grid_sizes`	Numeric vector, listing alternative grid sizes to consider.
`uniform_grid`	Logical, specifying whether to use uniform time grids (equal time intervals) or non-uniform time grids (more grid points towards the present, where more data are available).
`criterion`	Character, specifying which criterion to use for selecting the best grid. Options are "AIC" and "BIC".
`exhaustive`	Logical, whether to try all grid sizes before choosing the best one. If `FALSE`, the grid size is gradually increased until the selection criterio (e.g., AIC) starts becoming worse, at which point the search is halted. This avoids fitting models with excessive grid sizes when an optimum already seems to have been found at a smaller grid size.
`min_PDR`	Numeric vector of length Ngrid (=`max(1,length(age_grid))`), or a single numeric, specifying lower bounds for the fitted PDR at each point in the age grid. If a single numeric, the same lower bound applies at all ages. Note that in general the PDR may be negative as well as positive.
`max_PDR`	Numeric vector of length Ngrid, or a single numeric, specifying upper bounds for the fitted PDR at each point in the age grid. If a single numeric, the same upper bound applies at all ages. Use `+Inf` to omit upper bounds.
`min_rholambda0`	Strictly positive numeric, specifying the lower bound for the fitted `\rho\lambda_o` (sampling fraction times present-day extinction rate).
`max_rholambda0`	Strictly positive numeric, specifying the upper bound for the fitted `\rho\lambda_o`. Set to `+Inf` to omit this upper bound.
`guess_PDR`	Initial guess for the PDR at each age-grid point. Either `NULL` (an initial guess will be computed automatically), or a single numeric (guessing a constant PDR at all ages), or a function handle (for generating guesses at each grid point; this function may also return `NA` at some time points for which a guess shall be computed automatically).
`guess_rholambda0`	Numeric, specifying an initial guess for the product `\rho\lambda_o`. If `NULL`, a guess will be computed automatically.
`fixed_PDR`	Optional fixed (i.e. non-fitted) PDR values. Either `NULL` (none of the PDR values are fixed) or a function handle specifying the PDR for any arbitrary age (PDR will be fixed at any age for which this function returns a finite number). The function `fixed_PDR()` need not return finite values for all times, in fact doing so would mean that the PDR is not fitted anywhere.
`fixed_rholambda0`	Numeric, optionally specifying a fixed value for the product `\rho\lambda_o`. If `NULL` or `NA`, the product `\rho\lambda_o` is estimated.
`splines_degree`	Integer between 0 and 3 (inclusive), specifying the polynomial degree of the PDR between age-grid points. If 0, then the PDR is considered to be piecewise constant, if 1 then the PDR is considered piecewise linear, if 2 or 3 then the PDR is considered to be a spline of degree 2 or 3, respectively. The `splines_degree` influences the analytical properties of the curve, e.g. `splines_degree==1` guarantees a continuous curve, `splines_degree==2` guarantees a continuous curve and continuous derivative, and so on. A degree of 0 is generally not recommended.
`condition`	Character, either "crown", "stem", "auto", "stemN" or "crownN" (where N is an integer >=2), specifying on what to condition the likelihood. If "crown", the likelihood is conditioned on the survival of the two daughter lineages branching off at the root at that time. If "stem", the likelihood is conditioned on the survival of the stem lineage, with the process having started at `oldest_age`. Note that "crown" and "crownN"" really only make sense when `oldest_age` is equal to the root age, while "stem" is recommended if `oldest_age` differs from the root age. If "stem2", the condition is that the process yielded at least two sampled tips, and similarly for "stem3" etc. If "crown3", the condition is that a splitting occurred at the root age, both child clades survived, and in total yielded at least 3 sampled tips (and similarly for "crown4" etc). If "auto", the condition is chosen according to the recommendations mentioned earlier.
`relative_dt`	Strictly positive numeric (unitless), specifying the maximum relative time step allowed for integration over time, when calculating the likelihood. Smaller values increase integration accuracy but increase computation time. Typical values are 0.0001-0.001. The default is usually sufficient.
`Ntrials`	Integer, specifying the number of independent fitting trials to perform, each starting from a random choice of model parameters. Increasing `Ntrials` reduces the risk of reaching a non-global local maximum in the fitting objective.
`Nbootstraps`	Integer, specifying an optional number of bootstrap samplings to perform, for estimating standard errors and confidence intervals of maximum-likelihood fitted parameters. If 0, no bootstrapping is performed. Typical values are 10-100. At each bootstrap sampling, a random timetree is generated under the birth-death model according to the fitted PDR and `\rho\lambda_o`, the parameters are estimated anew based on the generated tree, and subsequently compared to the original fitted parameters. Each bootstrap sampling will use roughly the same information and similar computational resources as the original maximum-likelihood fit (e.g., same number of trials, same optimization parameters, same initial guess, etc). Bootstrapping is only performed for the best grid size.
`Ntrials_per_bootstrap`	Integer, specifying the number of fitting trials to perform for each bootstrap sampling. If `NULL`, this is set equal to `max(1,Ntrials)`. Decreasing `Ntrials_per_bootstrap` will reduce computation time, at the expense of potentially inflating the estimated confidence intervals; in some cases (e.g., for very large trees) this may be useful if fitting takes a long time and confidence intervals are very narrow anyway. Only relevant if `Nbootstraps>0`.
`Nthreads`	Integer, specifying the number of parallel threads to use for performing multiple fitting trials simultaneously. This should generally not exceed the number of available CPUs on your machine. Parallel computing is not available on the Windows platform.
`max_model_runtime`	Optional numeric, specifying the maximum number of seconds to allow for each evaluation of the likelihood function. Use this to abort fitting trials leading to parameter regions where the likelihood takes a long time to evaluate (these are often unlikely parameter regions).
`fit_control`	Named list containing options for the `nlminb` optimization routine, such as `iter.max`, `eval.max` or `rel.tol`. For a complete list of options and default values see the documentation of `nlminb` in the `stats` package.
`verbose`	Logical, specifying whether to print progress reports and warnings to the screen. Note that errors always cause a return of the function (see return values `success` and `error`).
`verbose_prefix`	Character, specifying the line prefix for printing progress reports to the screen.

Details

It is generally advised to provide as much information to the function fit_hbd_pdr_on_best_grid_size as possible, including reasonable lower and upper bounds (min_PDR, max_PDR, min_rholambda0 and max_rholambda0) and a reasonable parameter guess (guess_PDR and guess_rholambda0).

Value

A list with the following elements:

`success`	Logical, indicating whether the function executed successfully. If `FALSE`, the returned list will include an additional “error” element (character) providing a description of the error; in that case all other return variables may be undefined.
`best_fit`	A named list containing the fitting results for the best grid size. This list has the same structure as the one returned by `fit_hbd_pdr_on_grid`.
`grid_sizes`	Numeric vector, listing the grid sizes as provided during the function call.
`AICs`	Numeric vector of the same length as `grid_sizes`, listing the AIC for each considered grid size. Note that some entries may be `NA`, if the corresponding grid sizes were not considered (if `exhaustive=FALSE`).
`BICs`	Numeric vector of the same length as `grid_sizes`, listing the BIC for each considered grid size. Note that some entries may be `NA`, if the corresponding grid sizes were not considered (if `exhaustive=FALSE`).

Author(s)

Stilianos Louca

References

S. Louca et al. (2018). Bacterial diversification through geological time. Nature Ecology & Evolution. 2:1458-1467.

S. Louca and M. W. Pennell (2020). Extant timetrees are consistent with a myriad of diversification histories. Nature. 580:502-505.

Examples

## Not run: 
# Generate a random tree with exponentially varying lambda & mu
Ntips     = 10000
rho       = 0.5 # sampling fraction
time_grid = seq(from=0, to=100, by=0.01)
lambdas   = 2*exp(0.1*time_grid)
mus       = 1.5*exp(0.09*time_grid)
sim       = generate_random_tree( parameters  = list(rarefaction=rho), 
                                  max_tips    = Ntips/rho,
                                  coalescent  = TRUE,
                                  added_rates_times     = time_grid,
                                  added_birth_rates_pc  = lambdas,
                                  added_death_rates_pc  = mus)
tree = sim$tree
root_age = castor::get_tree_span(tree)$max_distance
cat(sprintf("Tree has %d tips, spans %g Myr\n",length(tree$tip.label),root_age))

# Fit PDR on grid, with the grid size chosen automatically between 1 and 5
fit = fit_hbd_pdr_on_best_grid_size(tree, 
                                    max_PDR           = 100,
                                    grid_sizes        = c(1:5),
                                    exhaustive        = FALSE,
                                    uniform_grid      = FALSE,
                                    Ntrials           = 10,
                                    Nthreads          = 4,
                                    verbose           = TRUE,
                                    max_model_runtime = 1)
if(!fit$success){
  cat(sprintf("ERROR: Fitting failed: %s\n",fit$error))
}else{
  best_fit = fit$best_fit
  cat(sprintf("Fitting succeeded:\nBest grid size=%d\n",length(best_fit$age_grid)))
  # plot fitted PDR
  plot( x     = best_fit$age_grid,
        y     = best_fit$fitted_PDR,
        main  = 'Fitted PDR',
        xlab  = 'age',
        ylab  = 'PDR',
        type  = 'b',
        xlim  = c(root_age,0))         
  # get fitted PDR as a function of age
  PDR_fun = approxfun(x=best_fit$age_grid, y=best_fit$fitted_PDR)
}

## End(Not run)

[Package castor version 1.8.2 Index]