fit.graph {poolfstat} | R Documentation |
Estimate parameters of an admixture graph
Description
Estimate parameters of an admixture graph
Usage
fit.graph(
graph.params,
Q.lambda = 0,
eps.admix.prop = 1e-06,
edge.fact = 1000,
admix.fact = 100,
compute.ci = F,
drift.scaling = F,
outfileprefix = NULL,
verbose = TRUE
)
Arguments
graph.params |
An object of class graph.params containing graph information and relevant Fstats estimates (see the function generate.graph.params) |
Q.lambda |
A scalar (usually small) to add to the diagonal elements of the error covariance matrix of fstats estimates (may improve numerical stability of its decomposition for large number of populations) |
eps.admix.prop |
A scalar defining admixture proportion domain (eps.admix.prop vary between eps.admix.prop and 1-eps.admix.prop) |
edge.fact |
The multiplying factor of edges length in graph representation |
admix.fact |
The multiplying factor of admixture proportion in graph representation |
compute.ci |
Derive 95% Confidence Intervals for the parameters of the admixture graph (edge lengths and admixture rates) |
drift.scaling |
If TRUE scale edge lengths in drift units (require estimates of leave heterozygosities) |
outfileprefix |
The prefix of the dot file that will represent the graph (with extension ".dot"). If NULL, no graph file generated |
verbose |
If TRUE extra information is printed on the terminal |
Details
Let f
represent the n-length vector of basis target (i.e., observed) F2 and F3 statistics and g(e;a)=X(a)*e
the vector of their expected values given the vector of graph edges lengths e
and the incidence matrix X(a)
that depends on the structure of the graph and the admixture rates a
(if there is no admixture in the graph, X(a)
only contains 0 or 1). The function attempts to find the e
and a
graph parameter values that minimize a cost (score of the model) defined as S(e;a)=(f-g(e;a))'Q^{-1}(f-g(e;a))
. Assuming f~N(g(e;a),Q)
(i.e., the observed f-statistics vector is multivariate normal distributed around an expected g vector specified by the admixture graph and a covariance structure empirically estimated), S=-2log(L) - K
where L
is the likelihood of the fitted graph and K=n*log(2*pi)+log(|Q|)
. Also, for model comparison purpose, a standard BIC
is then derived from S
as BIC= S + p*log(n) - K
(p being the number of graph parameters, i.e., edge lengths and admixture rates).
As mentioned by Patterson et al. (2012), the score S(e;a)
is quadratic in edge lengths e
given a
. The function uses the Lawson-Hanson non-negative linear least squares algorithm implemented in the nnls function (package nnls) to estimate e
(subject to the constraint of positive edge lengths) by finding the vector e
that minimize S(e;a)=(f-X(a)*e)'Q^{-1}(f-X(a)*e)=||G*f-G*X(a)*e||
(where G
results from the Cholesky decomposition of Q^{-1}
, i.e., Q^{-1}=G'G
). Note that the *Q.lambda* argument may be used to add a small constant (e.g., 1e-4
) to the diagonal elements of Q
to avoid numerical problems (see Patterson et al., 2012). Yet *Q.lambda* is always disregarded when computing the final score S
and BIC
. Minimization of S(e;a)
is thus reduced to the identification of the admixture rates (a
vector) which is performed using the L-BFGS-B method (i.e., Limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm with box constraints) implemented in the optim function (stats package). The *eps.admix.prop* argument allows specifying the lower and upper bound of the admixture rates to *eps.admix.prop* and *1-eps.admix.prop* respectively.
Scaling of the edges lengths in drift units (i.e., in units of t/2N
where t
is time in generations and N
is the effective population size) is performed as described in Lipson et al. (MBE, 2013) by dividing the estimated edges lengths by half the estimated heterozygosity of their parental nodes (using the property hp=hc+2e(C,P)
where hp
and hc
are the heterozygosities of a child C and its parent P node and e(C,P)
is the estimated length of the branch relating C and P.
Finally, if compute.ci=TRUE, a (rough) 95\%
confidence intervals is computed using a bisection method (with a 1e-4
precision) for each parameters in turn (all others being set to their estimated value). Note that 95\%
CI are here defined as the set of values associated to a score S
such that Sopt<S<Sopt+3.84
(where Sopt
is the optimized score), i.e., with a likelihood-ratio test statistic with respect to the fitted values <3.84
(the 95\%
threshold of a one ddl Chi-square distribution).
Value
An object of class fitted.graph (see help(fitted.graph) for details)
See Also
To generate a graph.params object, see generate.graph.params
. The fitted graph may be plotted directly using plot that calls grViz() function and the resulting fitted fstats may be compared to the estimated ones with compare.fitted.fstats
.