gl.impute {dartR.base} | R Documentation |
Imputes missing data
Description
This function imputes genotypes on a population-by-population basis, where populations can be considered panmictic, or imputes the state for presence-absence data.
Usage
gl.impute(
x,
method = "neighbour",
fill.residual = TRUE,
parallel = FALSE,
verbose = NULL
)
Arguments
x |
Name of the genlight object containing the SNP or presence-absence data [required]. |
method |
Imputation method, either "frequency" or "HW" or "neighbour" or "random" [default "neighbour"]. |
fill.residual |
Should any residual missing values remaining after imputation be set to 0, 1, 2 at random, taking into account global allele frequencies at the particular locus [default TRUE]. |
parallel |
A logical indicating whether multiple cores -if available- should be used for the computations (TRUE), or not (FALSE); requires the package parallel to be installed [default FALSE]. |
verbose |
Verbosity: 0, silent or fatal errors; 1, begin and end; 2, progress log ; 3, progress and results summary; 5, full report [default 2 or as specified using gl.set.verbosity]. |
Details
We recommend that imputation be performed on sampling locations, before any aggregation. Imputation is achieved by replacing missing values using either of two methods:
If "frequency", genotypes scored as missing at a locus in an individual are imputed using the average allele frequencies at that locus in the population from which the individual was drawn.
If "HW", genotypes scored as missing at a locus in an individual are imputed by sampling at random assuming Hardy-Weinberg equilibrium. Applies only to genotype data.
If "neighbour", substitute the missing values for the focal individual with the values taken from the nearest neighbour. Repeat with next nearest and so on until all missing values are replaced.
if "random", missing data are substituted by random values (0, 1 or 2).
The nearest neighbour is the one with the smallest Euclidean distance in
all the dataset.
The advantage of this approach is that it works regardless of how many
individuals are in the population to which the focal individual belongs,
and the displacement of the individual is haphazard as opposed to:
(a) Drawing the individual toward the population centroid (HW and Frequency).
(b) Drawing the individual toward the global centroid (glPCA).
Note that loci that are missing for all individuals in a population are not
imputed with method 'frequency' or 'HW'. Consider using the function
gl.filter.allna
with by.pop=TRUE to remove them first.
Value
A genlight object with the missing data imputed.
Author(s)
Custodian: Luis Mijangos (Post to https://groups.google.com/d/forum/dartr)
See Also
Other data manipulation:
gl.define.pop()
,
gl.drop.ind()
,
gl.drop.loc()
,
gl.drop.pop()
,
gl.edit.recode.pop()
,
gl.join()
,
gl.keep.ind()
,
gl.keep.loc()
,
gl.keep.pop()
,
gl.make.recode.ind()
,
gl.merge.pop()
,
gl.reassign.pop()
,
gl.recode.ind()
,
gl.recode.pop()
,
gl.rename.pop()
,
gl.sample()
,
gl.sim.genotypes()
,
gl.sort()
,
gl.subsample.ind()
,
gl.subsample.loc()
Examples
require("dartR.data")
# SNP genotype data
gl <- gl.filter.callrate(platypus.gl,threshold=0.95)
gl <- gl.filter.allna(gl)
gl <- gl.impute(gl,method="neighbour")
# Sequence Tag presence-absence data
gs <- gl.filter.callrate(testset.gs,threshold=0.95)
gl <- gl.filter.allna(gl)
gs <- gl.impute(gs, method="neighbour")
gs <- gl.impute(platypus.gl,method ="random")