ggoutlier_compositeKNN {GGoutlieR} | R Documentation |
GGoutlieR with the composite approach
Description
perform outlier identification with genetic space KNN and geographical space KNN. For the details of the outlier detection approach, please see the supplementary material of Chang and Schmid 2023 (doi:https://doi.org/10.1101/2023.04.06.535838)
Usage
ggoutlier_compositeKNN(
geo_coord,
gen_coord,
pgdM = NULL,
k_geneticKNN = NULL,
k_geoKNN = NULL,
klim = c(3, 50),
make_fig = FALSE,
plot_dir = ".",
w_geo = 1,
w_genetic = 2,
p_thres = 0.05,
n = 10^6,
s = 100,
min_nn_dist = 1000,
multi_stages = TRUE,
maxIter = NULL,
keep_all_stg_res = FALSE,
warning_minR2 = 0.9,
cpu = 1,
geneticKNN_output = NULL,
geoKNN_output = NULL,
verbose = TRUE
)
Arguments
geo_coord |
matrix or data.frame with two columns. The first column is longitude and the second one is latitude. |
gen_coord |
matrix. A matrix of "coordinates in a genetic space". Users can provide ancestry coefficients or eigenvectors for calculation. If, for example, ancestry coefficients are given, each column corresponds to an ancestral population. Samples are ordered in rows as in 'geo_coord'. |
pgdM |
matrix. A pairwise genetic distance matrix. Users can provide a customized genetic distance matrix with this argument. Samples are ordered in rows and columns as in the rows of 'geo_coord'. The default of 'pgdM' is 'NULL'. If 'pgdM' is not provided, a genetic distance matrix will be calculated from 'gen_coord'. NOTE: the genetic distance matrix is used in the search of KNN and as weights of KNN regression. |
k_geneticKNN |
integer. Number of the nearest neighbors in a genetic space. The default is 'NULL'. The 'ggoutlier' will search the optimal K if 'k_geneticKNN = NULL'. |
k_geoKNN |
integer. Number of the nearest neighbors in a geographical space. the default is 'NULL'. The 'ggoutlier' will search the optimal K if 'k_geoKNN = NULL'. |
klim |
vector. A range of K to search for the optimal number of nearest neighbors. The default is 'klim = c(3, 50)' |
make_fig |
logic. If 'make_fig = TRUE', plots for diagnosing GGoutlieR analysis will be generated and saved to 'plot_dir'. The default is 'FALSE' |
plot_dir |
string. The path to save plots |
w_geo |
numeric. A value controlling the power of distance weight in geographical KNN prediction. |
w_genetic |
numeric. A value controlling the power of distance weight in genetic KNN prediction. |
p_thres |
numeric. A significance level |
n |
numeric. A number of random samples to draw from the null distribution for making a graph. |
s |
integer. A scalar of geographical distance. The default 's=100' scales the distance to a unit of 0.1 kilometer. |
min_nn_dist |
numeric. A minimal geographical distance for searching KNNs. Neighbors of a focal sample within this distance will be excluded from the KNN searching procedure. |
multi_stages |
logic. A multi-stage test will be performed if is 'TRUE' (the default is 'TRUE'). |
maxIter |
numeric. Maximal iteration number of multi-stage KNN test. |
keep_all_stg_res |
logic. Results from all iterations of the multi-stage test will be retained if it is'TRUE'. (the default is 'FALSE') |
warning_minR2 |
numeric. The prediction accuracy of KNN is evaluated as R^2 to assess the violation of isolation-by-distance expectation. If any R^2 is larger than 'warning_minR2', a warning message will be reported at the end of your analysis. |
cpu |
integer. Number of CPUs to use for searching the optimal K. |
geneticKNN_output |
output of 'ggoutlier_geneticKNN'. Users can use this argument if running 'ggoutlier_geneticKNN' in advance. |
geoKNN_output |
output of 'ggoutlier_geoKNN'. Users can use this argument if running 'ggoutlier_geoKNN' in advance. |
verbose |
logic. If 'verbose = FALSE', 'ggoutlier' will suppress printout messages. |
Value
an object of nested 'list' with two subsidiary 'list' which are '"geneticKNN_result"' and '"geoKNN_result"'. Each subsidiary list includes five items: 'statistics' is a 'data.frame' consisting of the D_geography ("Dgeo") or D_genetics ("Dg") values, p values and a column of logic values showing if a sample is an outlier or not. 'threshold' is a 'data.frame' recording the significance threshold. 'gamma_parameter' is a vector recording the parameter of the heuristic Gamma distribution. 'knn_index' and 'knn_name' are a 'data.frame' recording the K nearest neighbors of each sample. The subsidiary list 'geneticKNN_result' has an additional item called '"scalar"', which records the value of geographical distance scalar used in the computation.