R: Check if a birth-death model adequately explains a timetree.

model_adequacy_hbd {castor}

R Documentation

Check if a birth-death model adequately explains a timetree.

Description

Given a rooted ultrametric timetree and a homogenous birth-death model, check if the model adequately explains various aspects of the tree, such as the branch length and node age distributions and other test statistics. The function uses bootstrapping to simulate multiple hypothetical trees according to the model and then compares the distribution of those trees to the original tree. This function may be used to quantify the "goodness of fit" of a birth-death model to a timetree.

Usage

model_adequacy_hbd( tree,
                    models,
                    splines_degree  = 1,
                    extrapolate     = FALSE,
                    Nbootstraps     = 1000,
                    Nthreads        = 1)

Arguments

`tree`	A rooted ultrametric timetree of class "phylo".
`models`	Either a single HBD model or a list of HBD models, specifying the pool of models from which to randomly draw bootstraps. Every model should itself be a named list with some or all of the following elements: `ages`: Numeric vector, specifying discrete ages (times before present) in ascending order, on which the pulled speciation rate will be specified. Age increases from tips to root; the youngest tip in the input tree has age 0. The age grid must cover the present-day (age 0) and the root `PSR`: Numeric vector of size NG, listing the pulled speciation rate (PSR) of the HBD model at the corresponding `ages`. Between grid points, the PSR is assumed to either be constant (if `splines_degree=0`), or linearly (if `splines_degree=1`) or quadratically (if `splines_degree=2`) or cubically (if `splines_degree=3`). To calculate the PSR of an HBD model based on the speciation and extinction rate, see `simulate_deterministic_hbd`.
`splines_degree`	Integer, one of 0, 1, 2 or 3, specifying the polynomial degree of the PSR between age-grid points. For example, `splines_degree=0` means piecewise constant, `splines_degree=1` means piecewise linear and so on.
`extrapolate`	Logical, specifying whether to extrapolate the model variables `\lambda`, `\mu`, `\psi` and `\kappa` (as constants) beyond the provided age grid all the way to `stem_age` and `end_age` if needed.
`Nbootstraps`	Integer, the number of bootstraps (simulations) to perform for calculating statistical significances. A larger number will increase the accuracy of estimated statistical significances.
`Nthreads`	Integer, number of parallel threads to use for bootstrapping. Note that on Windows machines this option is ignored.

Details

In addition to model selection, the adequacy of any chosen model should also be assessed in absolute terms, i.e. not just relative to other competing models (after all, all considered models might be bad). This function essentially determines how probable it is for hypothetical trees generated by a candidate model to resemble the tree at hand, in terms of various test statistics (such as the historically popular "gamma" statistic, or the Colless tree imbalance). In particular, the function uses a Kolmogorov-Smirnov test to check whether the probability distributions of edge lengths and node ages in the tree resemble those expected under the model. All statistical significances are calculated using bootstrapping, i.e. by simulating trees from the provided model with the same number of tips and the same root age as the original tree.

Note that even if an HBD model appears to adequately explain a given timetree, this does not mean that the model even approximately resembles the true diversification history (i.e., the true speciation and extinction rates) that generated the tree (Louca and Pennell 2020). Hence, it is generally more appropriate to say that a given model "congruence class" (or PSR) rather than a specific model (speciation rate, extinction rate, and sampling fraction) explains the tree.

This function requires that the HBD model (or more precisely, its congruence class) be defined in terms of the PSR. If your model is defined in terms of speciation/extinction rates and a sampling fraction, or if your model's congruence class is defined in terms of the pulled diversification rate (PDR) and the product \rho\lambda_o, then you can use simulate_deterministic_hbd to first calculate the corresponding PSR.

Value

A named list with the following elements:

`success`	Logical, indicating whether the model evaluation was successful. If `FALSE`, then an additional return variable, `error`, will contain a description of the error; in that case all other return variables may be undefined. Note that `success` does not say whether the model explains the tree, but rather whether the computation was performed without errors.
`Nbootstraps`	Integer, the number of bootstraps used.
`tree_gamma`	Numeric, gamma statistic (Pybus and Harvey 2000) of the original tree.
`bootstrap_mean_gamma`	Numeric, mean gamma statistic across all bootstrap trees.
`bootstrap_std_gamma`	Numeric, standard deviation of the gamma statistic across all bootstrap trees.
`Pgamma`	Numeric, two-sided statistical significance of the tree's gamma statistic under the provided null model, i.e. the probability that `abs(bootstrap_mean_gamma-tree_gamma)` would be as large as observed.
`RESgamma`	Numeric, relative effect size of the tree's gamma statistic compared to the provided null model, i.e. `(tree_gamma-bootstrap_mean_gamma)/abs(bootstrap_mean_gamma)`.
`SESgamma`	Numeric, standardized effect size of the tree's gamma statistic compared to the provided null model, i.e. `(tree_gamma-bootstrap_mean_gamma)/bootstrap_std_gamma`.
`tree_Colless`	Numeric, Colless imbalance statistic (Shao and Sokal, 1990) of the original tree.
`bootstrap_mean_Colless`	Numeric, mean Colless statistic across all bootstrap trees.
`bootstrap_std_Colless`	Numeric, standard deviation of the Colless statistic across all bootstrap trees.
`PColless`	Numeric, two-sided statistical significance of the tree's Colless statistic under the provided null model, i.e. the probability that `abs(bootstrap_mean_Colless-tree_Colless)` would be as large as observed.
`RESColless`	Numeric, relative effect size of the tree's Colless statistic compared to the provided null model, i.e. `(tree_Colless-bootstrap_mean_Colless)/abs(bootstrap_mean_Colless)`.
`SESColless`	Numeric, standardized effect size of the tree's Colless statistic compared to the provided null model, i.e. `(tree_Colless-bootstrap_mean_Colless)/bootstrap_std_Colless`.
`tree_Sackin`	Numeric, Sackin statistic (Sackin, 1972) of the original tree.
`bootstrap_mean_Sackin`	Numeric, mean Sackin statistic across all bootstrap trees.
`bootstrap_std_Sackin`	Numeric, standard deviation of the Sackin statistic across all bootstrap trees.
`PSackin`	Numeric, two-sided statistical significance of the tree's Sackin statistic under the provided null model, i.e. the probability that `abs(bootstrap_mean_Sackin-tree_Sackin)` would be as large as observed.
`RESSackin`	Numeric, relative effect size of the tree's Sackin statistic compared to the provided null model, i.e. `(tree_Sackin-bootstrap_mean_Sackin)/abs(bootstrap_mean_Sackin)`.
`SESSackin`	Numeric, standardized effect size of the tree's Sackin statistic compared to the provided null model, i.e. `(tree_Sackin-bootstrap_mean_Sackin)/bootstrap_std_Sackin`.
`tree_Blum`	Numeric, Blum statistic (Blum and Francois, 2006) of the original tree.
`bootstrap_mean_Blum`	Numeric, mean Blum statistic across all bootstrap trees.
`bootstrap_std_Blum`	Numeric, standard deviation of the Blum statistic across all bootstrap trees.
`PBlum`	Numeric, two-sided statistical significance of the tree's Blum statistic under the provided null model, i.e. the probability that `abs(bootstrap_mean_Blum-tree_Blum)` would be as large as observed.
`RESBlum`	Numeric, relative effect size of the tree's Blum statistic compared to the provided null model, i.e. `(tree_Blum-bootstrap_mean_Blum)/abs(bootstrap_mean_Blum)`.
`SESBlum`	Numeric, standardized effect size of the tree's Blum statistic compared to the provided null model, i.e. `(tree_Blum-bootstrap_mean_Blum)/bootstrap_std_Blum`.
`tree_edgeKS`	Numeric, Kolmogorov-Smirnov (KS) statistic of the original tree's edge lengths, i.e. the estimated maximum difference between the tree's and the model's (estimated) cumulative distribution function of edge lengths.
`bootstrap_mean_edgeKS`	Numeric, mean KS statistic of the bootstrap trees' edge lengths.
`bootstrap_std_edgeKS`	Numeric, standard deviation of the KS statistic of the bootstrap trees' edge lengths.
`PedgeKS`	Numeric, the one-sided statistical significance of the tree's edge-length KS statistic, i.e. the probability that the KS statistic of any tree generated by the model would be larger than the original tree's KS statistic. A low value means that the probability distribution of edge lengths in the original tree differs strongly from that expected based on the model.
`RESedgeKS`	Numeric, relative effect size of the tree's edge-length KS statistic compared to the provided null model, i.e. `(tree_edgeKS-bootstrap_mean_edgeKS)/abs(bootstrap_mean_edgeKS)`.
`SESedgeKS`	Numeric, standardized effect size of the tree's edge-length KS statistic compared to the provided null model, i.e. `(tree_edgeKS-bootstrap_mean_edgeKS)/bootstrap_std_edgeKS`.
`tree_nodeKS`	Numeric, Kolmogorov-Smirnov (KS) statistic of the original tree's node ages (divergence times), i.e. the estimated maximum difference between the tree's and the model's (estimated) cumulative distribution function of node ages.
`bootstrap_mean_nodeKS`	Numeric, mean KS statistic of the bootstrap trees' node ages.
`bootstrap_std_nodeKS`	Numeric, standard deviation of the KS statistic of the bootstrap trees' node ages.
`PnodeKS`	Numeric, the one-sided statistical significance of the tree's node-age KS statistic, i.e. the probability that the KS statistic of any tree generated by the model would be larger than the original tree's KS statistic. A low value means that the probability distribution of node ages in the original tree differs strongly from that expected based on the model.
`RESnodeKS`	Numeric, relative effect size of the tree's node-age KS statistic compared to the provided null model, i.e. `(tree_nodeKS-bootstrap_mean_nodeKS)/abs(bootstrap_mean_nodeKS)`.
`SESnodeKS`	Numeric, standardized effect size of the tree's node-age KS statistic compared to the provided null model, i.e. `(tree_nodeKS-bootstrap_mean_nodeKS)/bootstrap_std_nodeKS`.
`tree_sizeKS`	Numeric, Kolmogorov-Smirnov (KS) statistic of the original tree's node sizes (number of descending tips per node), i.e. the estimated maximum difference between the tree's and the model's (estimated) cumulative distribution function of node sizes.
`bootstrap_mean_sizeKS`	Numeric, mean KS statistic of the bootstrap trees' node sizes.
`bootstrap_std_sizeKS`	Numeric, standard deviation of the KS statistic of the bootstrap trees' node sizes.
`PsizeKS`	Numeric, the one-sided statistical significance of the tree's node-size KS statistic, i.e. the probability that the KS statistic of any tree generated by the model would be larger than the original tree's KS statistic. A low value means that the probability distribution of node sizes in the original tree differs strongly from that expected based on the model.
`RESsizeKS`	Numeric, relative effect size of the tree's node-size KS statistic compared to the provided null model, i.e. `(tree_sizeKS-bootstrap_mean_sizeKS)/abs(bootstrap_mean_sizeKS)`.
`SESsizeKS`	Numeric, standardized effect size of the tree's node-size KS statistic compared to the provided null model, i.e. `(tree_sizeKS-bootstrap_mean_sizeKS)/bootstrap_std_sizeKS`.
`statistical_tests`	Data frame, listing the above statistical test results in a more compact format (one test statistic per row).
`LTT_ages`	Numeric vector, listing ages (time before present) on which the tree's LTT will be defined.
`tree_LTT`	Numeric vector of the same length as `LTT_ages`, listing the number of lineages in the tree at every age in `LTT_ages`.
`bootstrap_LTT_CI`	Named list containing the elements `means`, `medians`, `CI50lower`, `CI50upper`, `CI95lower` and `CI95upper`. Each of these elements is a numeric vector of length equal to `LTT_ages`, listing the mean or a specific percentile of LTTs of bootstrap trees at every age in `LTT_ages`. For example, `bootstrap_LTT_CI$CI95lower[10]` and `bootstrap_LTT_CI$CI95upper[10]` define the lower and upper bound, respectively, of the 95% confidence interval of LTTs generated by the model at age `LTT_ages[10]`.
`fraction_LTT_in_CI95`	Numeric, fraction of the tree's LTT contained within the equal-tailed 95%-confidence interval of the distribution of LTT values predicted by the model. For example, a value of 0.5 means that at half of the time points between the present-day and the root, the tree's LTT is contained with the 95%-CI of predicted LTTs.

Author(s)

Stilianos Louca

References

S. Louca and M. W. Pennell (2020). Extant timetrees are consistent with a myriad of diversification histories. Nature. 580:502-505.

O. G. Pybus and P. H. Harvey (2000). Testing macro-evolutionary models using incomplete molecular phylogenies. Proceedings of the Royal Society of London. Series B: Biological Sciences. 267:2267-2272.

M. J. Sackin (1972). "Good" and "Bad" Phenograms. Systematic Biology. 21:225-226.

K.T. Shao, R. R. Sokal (1990). Tree Balance. Systematic Biology. 39:266-276.

M. G. B. Blum and O. Francois (2006). Which random processes describe the Tree of Life? A large-scale study of phylogenetic tree imbalance. Systematic Biology. 55:685-691.

Examples

# generate a tree
tree = castor::generate_tree_hbd_reverse(Ntips  = 50,
                                         lambda = 1,
                                         mu     = 0.5,
                                         rho    = 1)$trees[[1]]
root_age = castor::get_tree_span(tree)$max_distance

# define & simulate a somewhat different BD model
model = simulate_deterministic_hbd(LTT0         = 50,
                                   oldest_age   = root_age,
                                   lambda       = 1.5,
                                   mu           = 0.5,
                                   rho0         = 1)

# compare the tree to the model
adequacy = model_adequacy_hbd(tree, 
                              models        = model,
                              Nbootstraps   = 100,
                              Nthreads      = 2)
if(!adequacy$success){
    cat(sprintf("Adequacy test failed: %s\n",adequacy$error))
}else{
    print(adequacy$statistical_tests)
}

[Package castor version 1.8.2 Index]