n_eff {popkin} | R Documentation |
Calculates the effective sample size of the data
Description
The effective sample size n_eff
is the equivalent number of independent haplotypes that gives the same variance as that observed under the given population.
The variance in question is for the weighted sample mean ancestral allele frequency estimator.
It follows that n_eff
equals the inverse of the weighted mean kinship.
If max = TRUE
, a calculation is performed that implicitly uses optimal weights which maximize n_eff
, which equals the sum of the elements of the inverse kinship matrix.
However, if nonneg = TRUE
and if the above solution has negative weights (common), optimal non-negative weights are found instead (there are three algorithms available, see algo
).
If max = FALSE
, then the input weights are used in this calculation, and if weights are NULL
, uniform weights are used.
Usage
n_eff(
kinship,
max = TRUE,
weights = NULL,
nonneg = TRUE,
algo = c("gradient", "newton", "heuristic"),
tol = 1e-10
)
Arguments
kinship |
An |
max |
If |
weights |
Weights for individuals (optional).
If |
nonneg |
If |
algo |
Algorithm for finding optimal non-negative weights (applicable only if |
tol |
Tolerance parameter for "gradient" and "newton" algorithms. The algorithms converge when the norm of the step vector is smaller than this tolerance value. |
Details
The maximum n_eff
possible is 2 * n
, where n
is the number of individuals; this value is attained only when all haplotypes are independent (a completely unstructured population in Hardy-Weinberg equilibrium).
The minimum n_eff
possible is 1, which is attained in an extremely structured population with FST of 1, where every individual has exactly the same haplotype at every locus (no heterozygotes).
Moreover, for K
extremely-differentiated subpopulations (FST = 1 per subpopulation) n_eff = K
.
In this way, n_eff
is smaller than the ideal value of 2 * n
depending on the amount of kinship (covariance) in the data.
Occasionally, depending on the quality of the input kinship matrix, the estimated n_eff
may be outside the theoretical [1
, 2*n
] range, in which case the return value is set to the closest boundary value.
The quality of the results depends on the success of matrix inversion (which for numerical reasons may incorrectly contain negative eigenvalues, for example) or of the gradient optimization.
Value
A list containing n_eff
and weights
(optimal weights if max = TRUE
, input weights otherwise).
Examples
# Get n_eff from a genotype matrix
# Construct toy data
X <- matrix(c(0,1,2,1,0,1,1,0,2), nrow=3, byrow=TRUE) # genotype matrix
subpops <- c(1,1,2) # subpopulation assignments for individuals
# NOTE: for BED-formatted input, use BEDMatrix!
# "file" is path to BED file (excluding .bed extension)
## library(BEDMatrix)
## X <- BEDMatrix(file) # load genotype matrix object
# estimate the kinship matrix "kinship" from the genotypes "X"!
kinship <- popkin(X, subpops) # calculate kinship from X and optional subpop labels
weights <- weights_subpops(subpops) # can weigh individuals so subpopulations are balanced
# use kinship matrix to calculate n_eff
# default mode returns maximum n_eff possible across all non-negative weights that sum to one
# also returns the weights that were optimal
obj <- n_eff(kinship)
n_eff_max <- obj$n_eff
w_max <- obj$weights
# version that uses weights provided
obj <- n_eff(kinship, max = FALSE, weights = weights)
n_eff_w <- obj$n_eff
w <- obj$weights # returns input weights renormalized for good measure
# no (or NULL) weights implies uniform weights
obj <- n_eff(kinship, max = FALSE)
n_eff_u <- obj$n_eff
w <- obj$weights # uniform weights