R: Impute missing genotype data using k nearest neighbors

impute_missing_geno {cape}

R Documentation

Impute missing genotype data using k nearest neighbors

Description

This function uses k nearest neighbors to impute missing genotype data on a per chromosome basis. If missing genotypes remain after imputations the user can prioritize whether to remove individuals, markers, or whichever has fewer missing values.

Usage

impute_missing_geno(
  data_obj,
  geno_obj = NULL,
  k = 10,
  ind_missing_thresh = 0,
  marker_missing_thresh = 0,
  prioritize = c("fewer", "ind", "marker"),
  max_region_size = NULL,
  min_region_size = NULL,
  run_parallel = FALSE,
  verbose = FALSE,
  n_cores = 2
)

Arguments

`data_obj`	a `Cape` object
`geno_obj`	a genotype object
`k`	The number of nearest neighbors to use to impute missing data. Defaults to 10.
`ind_missing_thresh`	percent A percentage of acceptable missing data. After imputation if an individual is missing more data than the percent specified, it will be removed.
`marker_missing_thresh`	A percentage of acceptable missing data. After imputation if a marker is missing more data than the percent specified, it will be removed.
`prioritize`	How to prioritize removal of rows and columns with missing data. "ind" = remove individuals with missing data exceeding the threshold before considering markers to remove. "marker" = remove markers with missing data exceeding the threshold before considering individuals to remove. "fewer" = Determine how much data will be removed by prioritizing individuals or markers. Remove data in whichever order removes the least amount of data.
`max_region_size`	maximum number of markers to be used in calculating individual similarity. Defaults to the minimum chromosome size.
`min_region_size`	minimum number of markers to be used in calculating individual similarity Defaults to the maximum chromosome size.
`run_parallel`	A logical value indicating whether to run the process in parallel
`verbose`	A logical value indicating whether to print progress to the screen.
`n_cores`	integer number of available CPU cores to use for parallel processing

Details

This function is run by run_cape and runs automatically if a kinship correction is specified and there are missing values in the genotype object.

The prioritize parameter can be a bit confusing. If after imputation, there is one marker for which all data are missing, it makes sense to remove that one marker rather than all individuals with missing data, since all individuals would be removed. Similarly, if there is one individual with massive amounts of missing data, it makes sense to remove that individual, rather than all markers that individual is missing. We recommend always using the default "fewer" option here unless you know for certain that you want to prioritize individuals or markers for removal. There is no need to specify max_region_size or min_region_size, but advanced users may want to specify them. There is a trade-off between the time it takes to calculate a distance matrix for a large matrix and the time it takes to slide through the genome imputing markers. This function does not yet support imputation of covariates. If individuals are genotyped very densely, the user may want to specify max_region_size to be smaller than the maximum chromosome size to speed calculation of similarity matrices.

Value

This function returns a list that includes both the data_obj and geno_obj These objects must then be separated again to continue through the cape analysis.

Examples

## Not run: 
combined_obj <- impute_missing_geno(data_obj, geno_obj)
new_data_obj <- combined_obj$data_obj
noew_geno_obj <- combined_obj$geno_obj

## End(Not run)

[Package cape version 3.1.2 Index]