fit_hbd_model_parametric {castor} | R Documentation |

Given an ultrametric timetree, fit a homogenous birth-death (HBD) model in which speciation and extinction rates (*λ* and *μ*) are given as parameterized functions of time before present. “Homogenous” refers to the assumption that, at any given moment in time, all lineages exhibit the same speciation/extinction rates (in the literature this is sometimes referred to simply as “birth-death model”). Every HBD model is defined based on the values that *λ* and *μ* take over time as well as the sampling fraction *ρ* (fraction of extant species sampled); in turn, *λ*, *μ* and *ρ* can be parameterized by a finite set of parameters. This function estimates these parameters by maximizing the likelihood (Morlon et al. 2011) of the timetree under the resulting HBD model.

fit_hbd_model_parametric( tree, param_values, param_guess = NULL, param_min = -Inf, param_max = +Inf, param_scale = NULL, oldest_age = NULL, age0 = 0, lambda, mu = 0, rho0 = 1, age_grid = NULL, condition = "auto", relative_dt = 1e-3, Ntrials = 1, max_start_attempts = 1, Nthreads = 1, max_model_runtime = NULL, Nbootstraps = 0, Ntrials_per_bootstrap = NULL, fit_algorithm = "nlminb", fit_control = list(), focal_param_values = NULL, verbose = FALSE, diagnostics = FALSE, verbose_prefix = "")

`tree` |
An ultrametric timetree of class "phylo", representing the time-calibrated reconstructed phylogeny of a set of extant species. |

`param_values` |
Numeric vector, specifying fixed values for a some or all model parameters. For fitted (i.e., non-fixed) parameters, use |

`param_guess` |
Numeric vector of size NP, specifying a first guess for the value of each model parameter. For fixed parameters, guess values are ignored. Can be |

`param_min` |
Optional numeric vector of size NP, specifying lower bounds for model parameters. If of size 1, the same lower bound is applied to all parameters. Use |

`param_max` |
Optional numeric vector of size NP, specifying upper bounds for model parameters. If of size 1, the same upper bound is applied to all parameters. Use |

`param_scale` |
Optional numeric vector of size NP, specifying typical scales for model parameters. If of size 1, the same scale is assumed for all parameters. If |

`oldest_age` |
Strictly positive numeric, specifying the oldest time before present (“age”) to consider when calculating the likelihood. If this is equal to or greater than the root age, then |

`age0` |
Non-negative numeric, specifying the youngest age (time before present) to consider for fitting, and with respect to which |

`lambda` |
Function specifying the speciation rate at any given age (time before present) and for any given parameter values. This function must take exactly two arguments, the 1st one being a numeric vector (one or more ages) and the 2nd one being a numeric vector of size NP (parameter values), and return a numeric vector of the same size as the 1st argument with strictly positive entries. Can also be a single number (i.e., lambda is fixed). |

`mu` |
Function specifying the extinction rate at any given age and for any given parameter values. This function must take exactly two arguments, the 1st one being a numeric vector (one or more ages) and the 2nd one being a numeric vector of size NP (parameter values), and return a numeric vector of the same size as the 1st argument with non-negative entries. Can also be a single number (i.e., mu is fixed). |

`rho0` |
Function specifying the sampling fraction (fraction of extant species sampled at |

`age_grid` |
Numeric vector, specifying ages at which the |

`condition` |
Character, either "crown", "stem", "none" or "auto", specifying on what to condition the likelihood. If "crown", the likelihood is conditioned on the survival of the two daughter lineages branching off at the root. If "stem", the likelihood is conditioned on the survival of the stem lineage. Note that "crown" really only makes sense when |

`relative_dt` |
Strictly positive numeric (unitless), specifying the maximum relative time step allowed for integration over time, when calculating the likelihood. Smaller values increase integration accuracy but increase computation time. Typical values are 0.0001-0.001. The default is usually sufficient. |

`Ntrials` |
Integer, specifying the number of independent fitting trials to perform, each starting from a random choice of model parameters. Increasing |

`max_start_attempts` |
Integer, specifying the number of times to attempt finding a valid start point (per trial) before giving up on that trial. Randomly choosen extreme start parameters may occasionally result in Inf/undefined likelihoods, so this option allows the algorithm to keep looking for valid starting points. |

`Nthreads` |
Integer, specifying the number of parallel threads to use for performing multiple fitting trials simultaneously. This should generally not exceed the number of available CPUs on your machine. Parallel computing is not available on the Windows platform. |

`max_model_runtime` |
Optional numeric, specifying the maximum number of seconds to allow for each evaluation of the likelihood function. Use this to abort fitting trials leading to parameter regions where the likelihood takes a long time to evaluate (these are often unlikely parameter regions). |

`Nbootstraps` |
Integer, specifying the number of parametric bootstraps to perform for estimating standard errors and confidence intervals of estimated model parameters. Set to 0 for no bootstrapping. |

`Ntrials_per_bootstrap` |
Integer, specifying the number of fitting trials to perform for each bootstrap sampling. If |

`fit_algorithm` |
Character, specifying which optimization algorithm to use. Either "nlminb" or "subplex" are allowed. |

`fit_control` |
Named list containing options for the |

`focal_param_values` |
Optional numeric matrix having NP columns and an arbitrary number of rows, listing combinations of parameter values of particular interest and for which the log-likelihoods should be returned. This may be used for diagnostic purposes, e.g., to examine the shape of the likelihood function. |

`verbose` |
Logical, specifying whether to print progress reports and warnings to the screen. Note that errors always cause a return of the function (see return values |

`diagnostics` |
Logical, specifying whether to print detailed information (such as model likelihoods) at every iteration of the fitting routine. For debugging purposes mainly. |

`verbose_prefix` |
Character, specifying the line prefix for printing progress reports to the screen. |

This function is designed to estimate a finite set of scalar parameters (*p_1,..,p_n\in\R*) that determine the speciation rate *λ*, the extinction rate *μ* and the sampling fraction *ρ*, by maximizing the likelihood of observing a given timetree under the HBD model. For example, the investigator may assume that both *λ* and *μ* vary exponentially over time, i.e. they can be described by *λ(t)=λ_o\cdot e^{-α t}* and *μ(t)=μ_o\cdot e^{-β t}* (where *λ_o*, *μ_o* are unknown present-day rates and *α*, *β* are unknown factors, and *t* is time before present), and that the sampling fraction *ρ* is known. In this case the model has 4 free parameters, *p_1=λ_o*, *p_2=μ_o*, *p_3=α* and *p_4=β*, each of which may be fitted to the tree.

It is generally advised to provide as much information to the function `fit_hbd_model_parametric`

as possible, including reasonable lower and upper bounds (`param_min`

and `param_max`

), a reasonable parameter guess (`param_guess`

) and reasonable parameter scales `param_scale`

. If some model parameters can vary over multiple orders of magnitude, it is advised to transform them so that they vary across fewer orders of magnitude (e.g., via log-transformation). It is also important that the `age_grid`

is sufficiently fine to capture the variation of `lambda`

and `mu`

over time, since the likelihood is calculated under the assumption that both vary linearly between grid points.

A list with the following elements:

`success` |
Logical, indicating whether model fitting succeeded. If |

`objective_value` |
The maximized fitting objective. Currently, only maximum-likelihood estimation is implemented, and hence this will always be the maximized log-likelihood. |

`objective_name` |
The name of the objective that was maximized during fitting. Currently, only maximum-likelihood estimation is implemented, and hence this will always be “loglikelihood”. |

`param_fitted` |
Numeric vector of size NP (number of model parameters), listing all fitted or fixed model parameters in their standard order (see details above). If |

`param_guess` |
Numeric vector of size NP, listing guessed or fixed values for all model parameters in their standard order. If |

`loglikelihood` |
The log-likelihood of the fitted model for the given timetree. |

`NFP` |
Integer, number of fitted (i.e., non-fixed) model parameters. |

`AIC` |
The Akaike Information Criterion for the fitted model, defined as |

`BIC` |
The Bayesian information criterion for the fitted model, defined as |

`condition` |
Character, specifying what conditioning was root for the likelihood (e.g. "crown" or "stem"). |

`converged` |
Logical, specifying whether the maximum likelihood was reached after convergence of the optimization algorithm. Note that in some cases the maximum likelihood may have been achieved by an optimization path that did not yet converge (in which case it's advisable to increase |

`Niterations` |
Integer, specifying the number of iterations performed during the optimization path that yielded the maximum likelihood. |

`Nevaluations` |
Integer, specifying the number of likelihood evaluations performed during the optimization path that yielded the maximum likelihood. |

`trial_start_objectives` |
Numeric vector of size |

`trial_objective_values` |
Numeric vector of size |

`trial_Nstart_attempts` |
Integer vector of size |

`trial_Niterations` |
Integer vector of size |

`trial_Nevaluations` |
Integer vector of size |

`standard_errors` |
Numeric vector of size NP, estimated standard error of the parameters, based on parametric bootstrapping. Only returned if |

`medians` |
Numeric vector of size NP, median the estimated parameters across parametric bootstraps. Only returned if |

`CI50lower` |
Numeric vector of size NP, lower bound of the 50% confidence interval (25-75% percentile) for the parameters, based on parametric bootstrapping. Only returned if |

`CI50upper` |
Numeric vector of size NP, upper bound of the 50% confidence interval for the parameters, based on parametric bootstrapping. Only returned if |

`CI95lower` |
Numeric vector of size NP, lower bound of the 95% confidence interval (2.5-97.5% percentile) for the parameters, based on parametric bootstrapping. Only returned if |

`CI95upper` |
Numeric vector of size NP, upper bound of the 95% confidence interval for the parameters, based on parametric bootstrapping. Only returned if |

`consistency` |
Numeric between 0 and 1, estimated consistency of the data with the fitted model (Lindholm et al. 2019). See the documentation of |

Stilianos Louca

H. Morlon, T. L. Parsons, J. B. Plotkin (2011). Reconciling molecular phylogenies with the fossil record. Proceedings of the National Academy of Sciences. 108:16327-16332.

S. Louca et al. (2018). Bacterial diversification through geological time. Nature Ecology & Evolution. 2:1458-1467.

A. Lindholm, D. Zachariah, P. Stoica, T. B. Schoen (2019). Data consistency approach to model validation. IEEE Access. 7:59788-59796.

## Not run: # Generate a random tree with exponentially varying lambda & mu Ntips = 10000 rho = 0.5 # sampling fraction time_grid = seq(from=0, to=100, by=0.01) lambdas = 2*exp(0.1*time_grid) mus = 1.5*exp(0.09*time_grid) tree = generate_random_tree( parameters = list(rarefaction=rho), max_tips = Ntips/rho, coalescent = TRUE, added_rates_times = time_grid, added_birth_rates_pc = lambdas, added_death_rates_pc = mus)$tree root_age = castor::get_tree_span(tree)$max_distance cat(sprintf("Tree has %d tips, spans %g Myr\n",length(tree$tip.label),root_age)) # Define a parametric HBD model, with exponentially varying lambda & mu # Assume that the sampling fraction is known # The model thus has 4 parameters: lambda0, mu0, alpha, beta lambda_function = function(ages,params){ return(params['lambda0']*exp(-params['alpha']*ages)); } mu_function = function(ages,params){ return(params['mu0']*exp(-params['beta']*ages)); } rho_function = function(params){ return(rho) # rho does not depend on any of the parameters } # Define an age grid on which lambda_function & mu_function shall be evaluated # Should be sufficiently fine to capture the variation in lambda & mu age_grid = seq(from=0,to=100,by=0.01) # Perform fitting # Lets suppose extinction rates are already known cat(sprintf("Fitting model to tree..\n")) fit = fit_hbd_model_parametric( tree, param_values = c(lambda0=NA, mu0=3, alpha=NA, beta=-0.09), param_guess = c(1,1,0,0), param_min = c(0,0,-1,-1), param_max = c(10,10,1,1), param_scale = 1, # all params are in the order of 1 lambda = lambda_function, mu = mu_function, rho0 = rho_function, age_grid = age_grid, Ntrials = 10, # perform 10 fitting trials Nthreads = 2, # use 2 CPUs max_model_runtime = 1, # limit model evaluation to 1 second fit_control = list(rel.tol=1e-6)) if(!fit$success){ cat(sprintf("ERROR: Fitting failed: %s\n",fit$error)) }else{ cat(sprintf("Fitting succeeded:\nLoglikelihood=%g\n",fit$loglikelihood)) print(fit) } ## End(Not run)

[Package *castor* version 1.6.7 Index]