ggoutlier_geneticKNN {GGoutlieR}R Documentation

GGoutlieR with the genetic KNN approach

Description

identify samples geographically remote from K genetically nearest neighbors (genetic KNN). For the details of the outlier detection approach, please see the supplementary material of Chang and Schmid 2023 (doi:https://doi.org/10.1101/2023.04.06.535838)

Usage

ggoutlier_geneticKNN(
  geo_coord,
  gen_coord = NULL,
  pgdM = NULL,
  k = NULL,
  klim = c(3, 50),
  make_fig = FALSE,
  plot_dir = ".",
  w_power = 2,
  p_thres = 0.05,
  n = 10^6,
  s = 100,
  multi_stages = TRUE,
  maxIter = NULL,
  keep_all_stg_res = FALSE,
  warning_minR2 = 0.9,
  cpu = 1,
  verbose = TRUE
)

Arguments

geo_coord

matrix or data.frame with two columns. The first column is longitude and the second one is latitude.

gen_coord

matrix. A matrix of "coordinates in a genetic space". Users can provide ancestry coefficients or eigenvectors for calculation. If, for example, ancestry coefficients are given, each column corresponds to an ancestral population. Samples are ordered in rows as in 'geo_coord'.

pgdM

matrix. A pairwise genetic distance matrix. Users can provide a customized genetic distance matrix with this argument. Samples are ordered in rows and columns as in the rows of 'geo_coord'. The default of 'pgdM' is 'NULL'. If 'pgdM' is not provided, a genetic distance matrix will be calculated from 'gen_coord'.

k

integer. Number of the nearest neighbors.

klim

vector. A range of K to search for the optimal number of nearest neighbors. The default is 'klim = c(3, 50)'

make_fig

logic. If 'make_fig = TRUE', plots for diagnosing GGoutlieR analysis will be generated and saved to 'plot_dir'. The default is 'FALSE'

plot_dir

string. The path to save plots

w_power

numanceric. A value controlling the power of distance weight in genetic KNN prediction.

p_thres

numeric. A significe level

n

numeric. A number of random samples to draw from the null distribution for making a graph.

s

integer. A scalar of geographical distance. The default 's=100' scales the distance to a unit of 0.1 kilometer.

multi_stages

logic. A multi-stage test will be performed if is 'TRUE' (the default is 'TRUE').

maxIter

numeric. Maximal iteration number of multi-stage KNN test.

keep_all_stg_res

logic. Results from all iterations of the multi-stage test will be retained if it is'TRUE'. (the default is 'FALSE')

warning_minR2

numeric. The prediction accuracy of KNN is evaluated as R^2 to assess the violation of isolation-by-distance expectation. If any R^2 is larger than 'warning_minR2', a warning message will be reported at the end of your analysis.

cpu

integer. Number of CPUs to use for searching the optimal K.

verbose

logic. If 'verbose = FALSE', 'ggoutlier' will suppress printout messages.

Value

an object of 'list' including six items. 'statistics' is a 'data.frame' consisting of the 'D geography' "Dgeo" values, p values and a column of logic values showing if a sample is an outlier or not. 'threshold' is a 'data.frame' recording the significance threshold. 'gamma_parameter' is a vector recording the parameter of the heuristic Gamma distribution. 'knn_index' and 'knn_name' are a 'data.frame' recording the K nearest neighbors of each sample. 'scalar' is the value of geographical distance scalar used in the computation.


[Package GGoutlieR version 1.0.2 Index]