random_knn_query {rnndescent} | R Documentation |
Query nearest neighbors by random selection
Description
Run queries against reference data to return randomly selected neighbors.
This is not a useful query method on its own, but can be used with other
methods which require initialization.
Usage
random_knn_query(
query,
reference,
k,
metric = "euclidean",
use_alt_metric = TRUE,
order_by_distance = TRUE,
n_threads = 0,
verbose = FALSE,
obs = "R"
)
Arguments
query |
Matrix of n query items, with observations in the rows and
features in the columns. Optionally, the data may be passed with the
observations in the columns, by setting obs = "C" , which should be more
efficient. The reference data must be passed in the same orientation as
query . Possible formats are base::data.frame() , base::matrix()
or Matrix::sparseMatrix() . Sparse matrices should be in dgCMatrix
format. Dataframes will be converted to numerical matrix format
internally, so if your data columns are logical and intended to be used
with the specialized binary metric s, you should convert it to a logical
matrix first (otherwise you will get the slower dense numerical version).
|
reference |
Matrix of m reference items, with observations in the rows
and features in the columns. The nearest neighbors to the queries are
randomly selected from this data. Optionally, the data may be passed with
the observations in the columns, by setting obs = "C" , which should be
more efficient. The query data must be passed in the same orientation
and format as reference . Possible formats are base::data.frame() ,
base::matrix() or Matrix::sparseMatrix() . Sparse matrices should be in
dgCMatrix format.
|
k |
Number of nearest neighbors to return.
|
metric |
Type of distance calculation to use. One of:
-
"braycurtis"
-
"canberra"
-
"chebyshev"
-
"correlation" (1 minus the Pearson correlation)
-
"cosine"
-
"dice"
-
"euclidean"
-
"hamming"
-
"hellinger"
-
"jaccard"
-
"jensenshannon"
-
"kulsinski"
-
"sqeuclidean" (squared Euclidean)
-
"manhattan"
-
"rogerstanimoto"
-
"russellrao"
-
"sokalmichener"
-
"sokalsneath"
-
"spearmanr" (1 minus the Spearman rank correlation)
-
"symmetrickl" (symmetric Kullback-Leibler divergence)
-
"tsss" (Triangle Area Similarity-Sector Area Similarity or TS-SS
metric)
-
"yule"
For non-sparse data, the following variants are available with
preprocessing: this trades memory for a potential speed up during the
distance calculation. Some minor numerical differences should be expected
compared to the non-preprocessed versions:
For non-sparse binary data passed as a logical matrix, the following
metrics have specialized variants which should be substantially faster than
the non-binary variants (in other cases the logical data will be treated as
a dense numeric vector of 0s and 1s):
-
"dice"
-
"hamming"
-
"jaccard"
-
"kulsinski"
-
"matching"
-
"rogerstanimoto"
-
"russellrao"
-
"sokalmichener"
-
"sokalsneath"
-
"yule"
|
use_alt_metric |
If TRUE , use faster metrics that maintain the
ordering of distances internally (e.g. squared Euclidean distances if using
metric = "euclidean" ), then apply a correction at the end. Probably
the only reason to set this to FALSE is if you suspect that some
sort of numeric issue is occurring with your data in the alternative code
path.
|
order_by_distance |
If TRUE (the default), then results for each
item are returned by increasing distance. If you don't need the results
sorted, e.g. you are going to pass the results as initialization to another
routine like graph_knn_query() , set this to FALSE to save a
small amount of computational time.
|
n_threads |
Number of threads to use.
|
verbose |
If TRUE , log information to the console.
|
obs |
set to "C" to indicate that the input query and reference
orientation stores each observation as a column (the orientation must be
consistent). The default "R" means that observations are stored in each
row. Storing the data by row is usually more convenient, but internally
your data will be converted to column storage. Passing it already
column-oriented will save some memory and (a small amount of) CPU usage.
|
Value
an approximate nearest neighbor graph as a list containing:
Examples
# 100 reference iris items
iris_ref <- iris[iris$Species %in% c("setosa", "versicolor"), ]
# 50 query items
iris_query <- iris[iris$Species == "versicolor", ]
# For each item in iris_query find 4 random neighbors in iris_ref
# If you pass a data frame, non-numeric columns are removed
# set verbose = TRUE to get details on the progress being made
iris_query_random_nbrs <- random_knn_query(iris_query,
reference = iris_ref,
k = 4, metric = "euclidean", verbose = TRUE
)
# Manhattan (l1) distance
iris_query_random_nbrs <- random_knn_query(iris_query,
reference = iris_ref,
k = 4, metric = "manhattan"
)
[Package
rnndescent version 0.1.6
Index]