fit.graph {poolfstat}R Documentation

Estimate parameters of an admixture graph

Description

Estimate parameters of an admixture graph

Usage

fit.graph(
  graph.params,
  Q.lambda = 0,
  eps.admix.prop = 1e-06,
  edge.fact = 1000,
  admix.fact = 100,
  compute.ci = F,
  drift.scaling = F,
  outfileprefix = NULL,
  verbose = TRUE
)

Arguments

graph.params

An object of class graph.params containing graph information and relevant Fstats estimates (see the function generate.graph.params)

Q.lambda

A scalar (usually small) to add to the diagonal elements of the error covariance matrix of fstats estimates (may improve numerical stability of its decomposition for large number of populations)

eps.admix.prop

A scalar defining admixture proportion domain (eps.admix.prop vary between eps.admix.prop and 1-eps.admix.prop)

edge.fact

The multiplying factor of edges length in graph representation

admix.fact

The multiplying factor of admixture proportion in graph representation

compute.ci

Derive 95% Confidence Intervals for the parameters of the admixture graph (edge lengths and admixture rates)

drift.scaling

If TRUE scale edge lengths in drift units (require estimates of leave heterozygosities)

outfileprefix

The prefix of the dot file that will represent the graph (with extension ".dot"). If NULL, no graph file generated

verbose

If TRUE extra information is printed on the terminal

Details

Let ff represent the n-length vector of basis target (i.e., observed) F2 and F3 statistics and g(e;a)=X(a)eg(e;a)=X(a)*e the vector of their expected values given the vector of graph edges lengths ee and the incidence matrix X(a)X(a) that depends on the structure of the graph and the admixture rates aa (if there is no admixture in the graph, X(a)X(a) only contains 0 or 1). The function attempts to find the ee and aa graph parameter values that minimize a cost (score of the model) defined as S(e;a)=(fg(e;a))Q1(fg(e;a))S(e;a)=(f-g(e;a))'Q^{-1}(f-g(e;a)). Assuming f N(g(e;a),Q)f~N(g(e;a),Q) (i.e., the observed f-statistics vector is multivariate normal distributed around an expected g vector specified by the admixture graph and a covariance structure empirically estimated), S=2log(L)KS=-2log(L) - K where LL is the likelihood of the fitted graph and K=nlog(2pi)+log(Q)K=n*log(2*pi)+log(|Q|). Also, for model comparison purpose, a standard BICBIC is then derived from SS as BIC=S+plog(n)KBIC= S + p*log(n) - K (p being the number of graph parameters, i.e., edge lengths and admixture rates). As mentioned by Patterson et al. (2012), the score S(e;a)S(e;a) is quadratic in edge lengths ee given aa. The function uses the Lawson-Hanson non-negative linear least squares algorithm implemented in the nnls function (package nnls) to estimate ee (subject to the constraint of positive edge lengths) by finding the vector ee that minimize S(e;a)=(fX(a)e)Q1(fX(a)e)=GfGX(a)eS(e;a)=(f-X(a)*e)'Q^{-1}(f-X(a)*e)=||G*f-G*X(a)*e|| (where GG results from the Cholesky decomposition of Q1Q^{-1}, i.e., Q1=GGQ^{-1}=G'G). Note that the *Q.lambda* argument may be used to add a small constant (e.g., 1e41e-4) to the diagonal elements of QQ to avoid numerical problems (see Patterson et al., 2012). Yet *Q.lambda* is always disregarded when computing the final score SS and BICBIC. Minimization of S(e;a)S(e;a) is thus reduced to the identification of the admixture rates (aa vector) which is performed using the L-BFGS-B method (i.e., Limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm with box constraints) implemented in the optim function (stats package). The *eps.admix.prop* argument allows specifying the lower and upper bound of the admixture rates to *eps.admix.prop* and *1-eps.admix.prop* respectively. Scaling of the edges lengths in drift units (i.e., in units of t/2Nt/2N where tt is time in generations and NN is the effective population size) is performed as described in Lipson et al. (MBE, 2013) by dividing the estimated edges lengths by half the estimated heterozygosity of their parental nodes (using the property hp=hc+2e(C,P)hp=hc+2e(C,P) where hphp and hchc are the heterozygosities of a child C and its parent P node and e(C,P)e(C,P) is the estimated length of the branch relating C and P. Finally, if compute.ci=TRUE, a (rough) 95%95\% confidence intervals is computed using a bisection method (with a 1e41e-4 precision) for each parameters in turn (all others being set to their estimated value). Note that 95%95\% CI are here defined as the set of values associated to a score SS such that Sopt<S<Sopt+3.84Sopt<S<Sopt+3.84 (where SoptSopt is the optimized score), i.e., with a likelihood-ratio test statistic with respect to the fitted values <3.84<3.84 (the 95%95\% threshold of a one ddl Chi-square distribution).

Value

An object of class fitted.graph (see help(fitted.graph) for details)

See Also

To generate a graph.params object, see generate.graph.params. The fitted graph may be plotted directly using plot that calls grViz() function and the resulting fitted fstats may be compared to the estimated ones with compare.fitted.fstats.


[Package poolfstat version 2.2.0 Index]