graph_knn_query {rnndescent} | R Documentation |
Query a search graph for nearest neighbors
Description
Run queries against a search graph, to return nearest neighbors taken from the reference data used to build that graph.
Usage
graph_knn_query(
query,
reference,
reference_graph,
k = NULL,
metric = "euclidean",
init = NULL,
epsilon = 0.1,
max_search_fraction = 1,
use_alt_metric = TRUE,
n_threads = 0,
verbose = FALSE,
obs = "R"
)
Arguments
query |
Matrix of |
reference |
Matrix of |
reference_graph |
Search graph of the |
k |
Number of nearest neighbors to return. Optional if |
metric |
Type of distance calculation to use. One of:
For non-sparse data, the following variants are available with preprocessing: this trades memory for a potential speed up during the distance calculation. Some minor numerical differences should be expected compared to the non-preprocessed versions:
For non-sparse binary data passed as a
|
init |
Initial
|
epsilon |
Controls trade-off between accuracy and search cost, as
described by Iwasaki and Miyazaki (2018), by specifying a distance
tolerance on whether to explore the neighbors of candidate points. The
larger the value, the more neighbors will be searched. A value of 0.1
allows query-candidate distances to be 10% larger than the current
most-distant neighbor of the query point, 0.2 means 20%, and so on.
Suggested values are between 0-0.5, although this value is highly dependent
on the distribution of distances in the dataset (higher dimensional data
should choose a smaller cutoff). Too large a value of |
max_search_fraction |
Maximum fraction of the reference data to search.
This is a value between 0 (search none of the reference data) and 1 (search
all of the data if necessary). This works in conjunction with |
use_alt_metric |
If |
n_threads |
Number of threads to use. |
verbose |
If |
obs |
set to |
Details
A greedy beam search is used to query the graph, combining two search pruning
strategies. The first, due to Iwasaki and Miyazaki (2018), only considers
new candidates within a relative distance of the current furthest neighbor
in the query's graph. The second, due to Harwood and Drummond (2016), puts a
limit on the absolute number of distance calculations to carry out. See the
epsilon
and max_search_fraction
parameters respectively.
Value
the approximate nearest neighbor graph as a list containing:
-
idx
an
byk
matrix containing the nearest neighbor indices specifying the row of the neighbor inreference
. -
dist
an
byk
matrix containing the nearest neighbor distances.
References
Harwood, B., & Drummond, T. (2016). Fanng: Fast approximate nearest neighbour graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5713-5722).
Iwasaki, M., & Miyazaki, D. (2018). Optimization of indexing based on k-nearest neighbor graph for proximity search in high-dimensional data. arXiv preprint arXiv:1810.07355.
Examples
# 100 reference iris items
iris_ref <- iris[iris$Species %in% c("setosa", "versicolor"), ]
# 50 query items
iris_query <- iris[iris$Species == "versicolor", ]
# First, find the approximate 4-nearest neighbor graph for the references:
iris_ref_graph <- nnd_knn(iris_ref, k = 4)
# For each item in iris_query find the 4 nearest neighbors in iris_ref.
# You need to pass both the reference data and the reference graph.
# If you pass a data frame, non-numeric columns are removed.
# set verbose = TRUE to get details on the progress being made
iris_query_nn <- graph_knn_query(iris_query, iris_ref, iris_ref_graph,
k = 4, metric = "euclidean", verbose = TRUE
)
# A more complete example, converting the initial knn into a search graph
# and using a filtered random projection forest to initialize the search
# create initial knn and forest
iris_ref_graph <- nnd_knn(iris_ref, k = 4, init = "tree", ret_forest = TRUE)
# keep the best tree in the forest
forest <- rpf_filter(iris_ref_graph, n_trees = 1)
# expand the knn into a search graph
iris_ref_search_graph <- prepare_search_graph(iris_ref, iris_ref_graph)
# run the query with the improved graph and initialization
iris_query_nn <- graph_knn_query(iris_query, iris_ref, iris_ref_search_graph,
init = forest, k = 4
)