phyreg {phyreg} | R Documentation |
Performs a phylogenetic regression
Description
If this documentation looks daunting, look first at the examples below!
Takes a control formula, some test terms, a dataframe, a phylogeny, and optionally a set of node heights, and performs a test of H0 (control terms only) vs HA (control plus test terms). The user can specify a value of rho (a parameter that determines whether branch lengths are bigger near the root or near the tips), or ask that it be fitted by maximum likelihood. The primary outputs are a p-value and F-ratio, and parameter estimates. Much more information is available, much of it needing to be interpreted very cautiously.
Usage
phyreg(control, test, data, subset, phydata, taxmatrix, heightsdata,
rho = -1, lorho = 0.3, hirho = 0.6, errrho = 0.02, minrho = 1e-04,
tolerance = 1e-06, oppf = 5, opdf = 0, parmx = 0, parmxz = 0,
opfunccall = 0, addDF = 0, linputs = FALSE, sinputs = FALSE, means = FALSE,
lmshortx = FALSE, lmshortxz = FALSE, lmlongx = FALSE, lmlongxz = FALSE,
hinput = FALSE, paper = FALSE, dfwarning = TRUE, oprho = FALSE, reset = FALSE)
Arguments
control |
The null hypothesis model, which should be a formula with a response variable. If there are no variables to control for, specify |
test |
The test terms. To control for |
data |
A mandatory dataframe containing the variables in the model formulae |
subset |
a vector of the same length as the data, containing 0/1 for exclusion or inclusion of a species. If unspecified, the internal default is to include all species. To return to being unspecified, say |
phydata |
a vector with the phylogeny in the internal package format (containing for each non-root node the ID number of its parent). Either |
taxmatrix |
a dataframe with a set of taxonomic vectors to specify the working phylogeny. There should be no species column, and the others should start with low levels (e.g. genus) and end with the highest (perhaps order). If |
heightsdata |
If not specified, the height of nodes will be calculated using the default "Figure 2" method of Grafen (1989). If a height is specified for each node, then the vector will be one element longer than |
rho |
If zero or negative, |
lorho |
The starting lower bound for the search for rho |
hirho |
The starting upper bound for the search for rho |
errrho |
The search for rho will stop when it is known to lie in an interval of length |
minrho |
The search for rho will stop when the whole current search interval lies below |
tolerance |
This tolerance decides whether there is enough variability among the daughters of a higher node to justify putting it into the short regression. This is because the residuals in the long regression on the control variables are used to calculate a contrast; if they are very small, then the contrast is effectively being determined by machine rounding errors, and if they are zero, then all contrasts should also be zero and so the higher node will be irrelevant in a regression through the origin. |
oppf |
If non-zero, the p-value and F-statistic are printed out at each call of |
opdf |
If non-zero, a breakdown of degrees of freedom is printed out at each call of |
parmx |
If non-zero, the parameters in the long regression with the control formula are printed out at each call of |
parmxz |
If non-zero, the parameters in the long regression with the control+test formula are printed out at each call of |
opfunccall |
If non-zero, the function call is printed out |
addDF |
Advised to leave well alone. Included for backwards compatibility with the original GLIM version. It was originally but misguidedly intended to make a correction to the total degrees of freedom for the test in recognition of the fitting of rho. However, the fitting of rho is now regarded as affecting the numerator and denominator SS equally, and so not to require the correction. |
linputs |
If |
sinputs |
If |
means |
If |
lmshortx |
If |
lmshortxz |
If |
lmlongx |
If |
lmlongxz |
If |
hinput |
If |
paper |
If |
dfwarning |
If |
oprho |
If nonzero, details of rho will be printed with |
reset |
If |
Details
One key point for using the program is that each node in the phylogeny has an ID number, with which it is identified. For a species, the number is given by its position in the data frame with the data itself, which is assumed to have species in the same order as in the taxonomic variables data frame, if that is specified, and the same order as in the newick-style phylogeny if you use phyfromnewick
. If phyfromphylo
is used, the code numbers of the species are carried over from the "phylo" object to the internally formatted "phy" vector. These species ID numbers are fixed no matter whether that species or another is excluded by the subset
parameter of phyreg
, or for missing data. Higher nodes have ID numbers too. In every phylogeny, each node's parent has a higher ID number than it does, except the root, which has no parent. The numbers for higher nodes do vary when species are omitted for any reason.
The main feature of phylogenetic data that the Phylogenetic Regression deals with differently from other methods is unrecognised phylogeny. It operates on the principle that each higher node should provide one datapoint to a valid test, and not more, and does so by choosing just one linear contrast across the daughters of each higher node. The contrast coefficients come from the residuals of the long regression on the control model. For a full mathematical justification of this choice, see the appendix to Grafen (1989), and see the paper itself for conceptual explanations. With binary phylogenies, of course, there is no unrecognised phylogeny. A parameter of the branch lengths is fitted automatically, unless the user wishes to impose a value, which allows the strength of phylogenetic association to be make weaker or stronger. Simulation studies in Grafen (1989) show that the method has good properties, and also that ignoring unrecognised phylogeny can create serious problems.
On some occasions, degrees of freedom are lost "for phylogenetic reasons". A whole node may be lost to the final test if the residuals of its daughter nodes are all zero in the long regression. This can happen for various reasons, most often when (i) the response is in fact binary, and so there is no variation in it below a node, or (ii) a categorical variable has so many values restricted to one part of the tree that a subset of its parameter values can adjust to render all the residuals zero in that part of the tree. That is called a node being lost in the denominator. The other possibility is that once the contrasts have been taken across each higher node, the design matrix for the model has lower rank than it did before, which is called losing a degree of freedom in the numerator (it is transferred to the denominator). You can choose to be warned when degrees of freedom are lost for phylogenetic reasons (use dfwarning
=1), or to see a whole breakdown of degrees of freedom including any lost (use opdf=1
). If nodes are lost in the denominator, their ID numbers are stored in the output $shornode
, though note this also contains an initial 0
for programming convenience. These phylogenetic degree of freedom issues need to be handled properly to provide a valid test – see Grafen (1989) for details.
The data for long and short regressions, and the lm.wfit()
output are provided on request, but must be used with great caution. The whole point of the phylogenetic regression is that neither type of regression provides results that can be taken at face value, except the long regression under binary phylogenies. Use at your own risk! See Grafen (1989) for details.
Some simulated data is included to facilitate the examples below – see myd0
Value
H0model |
The control model |
HAmodel |
The control+test model |
spu |
A vector with a 0/1 for each species to indicate whether it was used (1) or not (0), depending on the argument |
nomspuse |
The number of species omitted because of missing values (not counting those already omitted because of the argument |
longrss |
The residual sum of squares in the long regression with the control formula |
missingnodes |
A vector of the ID numbers of nodes omitted for phyogenetic reasons (see Details above). An extra zero is included for programming convenience. |
shornode |
The ID numbers of the higher nodes in the supplied phylogeny that are used in the short regression, so |
details |
contains various numbers needed for output, namely |
means |
The phylogenetically weighted means of the response variable and the design matrix for the control+test model |
parmx |
Parameter estimates from the long regression with the control model |
parmxz |
Parameter estimates from the long regression with the control+test model |
funccall |
The function call with which you invoked |
fullphy |
The full phylogeny for all species. This will just be |
usedphy |
The phylogeny for the species included in the analysis. There is an entry for every species, and each included species has its original ID, and every omitted species has zero. There may well be fewer higher nodes in |
originalIDs |
This tells you, for each node in |
rho |
A list with |
sinputs |
(optional) the inputs to the short regression namely |
linputs |
(optional) the inputs to the long regression namely |
lmlongx |
(optional) The |
lmlongxz |
(optional) The |
lmshortx |
(optional) The |
lmshortxz |
(optional) The |
Author(s)
Alan Grafen, with portions copied as follows.
(1) read.newick
(used internally only) copied from Liam Rewell (see phyfromnewick
for a more detailed acknowledgment)
(2) the definition of ginv
has been copied from MASS (package comes with current R downloads; book is W.N. Venables and B.D. Ripley (2002) Modern Applied Statistics in S. (Fourth Edition), Springer – http://www.stats.ox.ac.uk/pub/MASS4). This avoids having to require
the whole package, though it may mean I have amend it if R and MASS make a simultaneous change in the low-level routines they use. ginv
was copied and pasted from R 3.0.2 on 64-bit MacOS on 24th January 2014. It finds the generalised inverse of a matrix.
(3) code for dealing with model formulae (internal functions merge.formulae.ag
and merge.formulae.test.ag
) was adapted from code of Steven Carlisle Walker obtained from
https://stevencarlislewalker.wordpress.com/2012/08/06/merging-combining-adding-together-two-formula-objects-in-r/ in January 2014.
References
The source paper is Grafen, A. 1989. The phylogenetic regression. Philosophical Transactions of the Royal Society B, 326, 119-157. Some further information is at http://users.ox.ac.uk/~grafen/phylo/index.html, including previous implementations in GLIM and SAS.
See Also
The package helpfile is at phyreg-package
. Functions are inf
, phyfromnewick
, phyfromphylo
, factory_default
, new_default
. For simulated data see myd0
.
Examples
## First get some data
data(myd0,myd1, myd2, myd3, extax)
## Then do our first analysis
phyreg(y~X1,"A",myd0,taxmatrix=extax)
## and test instead for "B", noticing that only the changed parameter need be given
phyreg(test="B")
## and we do more complicated analysis involving an interaction with an existing term
phyreg(y~X1+X2,"A*X1")
## Now we choose to see the output relating to rho and to how the degrees of freedom are determined,
## and we also wish to see the means for each variable, and the parameters from the long
## regression on control+test variables
phyreg(oprho=6, opdf=1, means=1,parmxz=1)
## To illustrate inf, we store the results of an analysis in m1
m1<-phyreg(y~A,"X1+X2")
## Note we still get the extra output from the previous call, because those parameters
## too are remembered within a session. But we can see it again, whether or not we
## saw it the first time, with inf. inf reminds you if you forget quite how to use it
inf()
inf(m1)
inf(m1,oppf=3)
inf(m1,oppf=7, oprho=5)
inf(m1, oppf=5, "means", "parmx")
inf(m1,"sinputs","lmshortx")
## The final call asks for things m1 doesn't have because it wasn't stored at the time. Now
## we turn to changing the default parameters with new_default and factory_default. The help
## pages for those functions explain that only minor parameters (those affecting output and
## storage, and those affecting rho), and not the models or the data-bearing parameters, have
## default values.
new_default()
## This call takes the most recent parameters to phyreg and makes them the default, which
## in the first instance changes nothing. But if we later call reset=TRUE as an argument
## of phyreg, it is the values at the time of calling new_default() that will be returned
## to. In this instance, we would automatically get output on rho and degrees
## of freedom, and the means and xz parameters printed).
##
## To see this in operation, we would change those parameters...
phyreg(oprho=0,opdf=0,means=0,parmxz=0)
## ... and then in the next call use reset=1. That will restore the defaults, which we changed
## with new_default. To change the defaults back to the values that shipped with the
## package,
factory_default()
## does the job.
##
## Finally, the phylogeny has so far been determined by a data frame of taxonomic variables in the
## argument taxmatrix. If we have the phylogeny available in newick style, we can convert to
## the internal format, and then use that instead. Fortunately, one is provided. Note it is
## good form to unset the other method of specifying a phylogeny (which is being remembered by
## the package) with taxmatrix=NULL
data(newickstr)
z<-phyfromnewick(text=newickstr)
phyreg(phydata=z$phy,taxmatrix=NULL)
## ... and if branch lengths were supplied, and we trust them, we can
phyreg(y~X1, "A", phydata=z$phy, heightsdata=z$hts)
## Similarly with a phylogeny in phylo format. We obtain one of those by using the "phylo"
## value from phyfromnewick, and use an intermediate variable"phyloobject" to show how
## to use one if you have one
phyloobject<-z$orphylo
phyconverted<-phyfromphylo(phyloobject)
phyreg(phydata=phyconverted$phy)
phyreg(phydata=phyconverted$phy, heightsdata=phyconverted$hts)
## The results should all be the same because it's the same phylogeny represented in three
## different ways, with the same heights.
##
## Enjoy!