nnd_knn {rnndescent} | R Documentation |
Find nearest neighbors using nearest neighbor descent
Description
Uses the Nearest Neighbor Descent method due to Dong and co-workers (2011) to optimize an approximate nearest neighbor graph.
Usage
nnd_knn(
data,
k = NULL,
metric = "euclidean",
init = "rand",
init_args = NULL,
n_iters = NULL,
max_candidates = NULL,
delta = 0.001,
low_memory = TRUE,
weight_by_degree = FALSE,
use_alt_metric = TRUE,
n_threads = 0,
verbose = FALSE,
progress = "bar",
obs = "R",
ret_forest = FALSE
)
Arguments
data |
Matrix of |
k |
Number of nearest neighbors to return. Optional if |
metric |
Type of distance calculation to use. One of:
For non-sparse data, the following variants are available with preprocessing: this trades memory for a potential speed up during the distance calculation. Some minor numerical differences should be expected compared to the non-preprocessed versions:
For non-sparse binary data passed as a
|
init |
Name of the initialization strategy or initial
If
|
init_args |
a list containing arguments to pass to the random partition
forest initialization. See |
n_iters |
Number of iterations of nearest neighbor descent to carry out.
By default, this will be chosen based on the number of observations in
|
max_candidates |
Maximum number of candidate neighbors to try for each
item in each iteration. Use relative to |
delta |
The minimum relative change in the neighbor graph allowed before
early stopping. Should be a value between 0 and 1. The smaller the value,
the smaller the amount of progress between iterations is allowed. Default
value of |
low_memory |
If |
weight_by_degree |
If |
use_alt_metric |
If |
n_threads |
Number of threads to use. |
verbose |
If |
progress |
Determines the type of progress information logged if
|
obs |
set to |
ret_forest |
If |
Details
If no initial graph is provided, a random graph is generated, or you may also specify the use of a graph generated from a forest of random projection trees, using the method of Dasgupta and Freund (2008).
Value
the approximate nearest neighbor graph as a list containing:
-
idx
an n by k matrix containing the nearest neighbor indices. -
dist
an n by k matrix containing the nearest neighbor distances. -
forest
(ifinit = "tree"
andret_forest = TRUE
only): the RP forest used to initialize the neighbor data.
References
Dasgupta, S., & Freund, Y. (2008, May). Random projection trees and low dimensional manifolds. In Proceedings of the fortieth annual ACM symposium on Theory of computing (pp. 537-546). doi:10.1145/1374376.1374452.
Dong, W., Moses, C., & Li, K. (2011, March). Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World Wide Web (pp. 577-586). ACM. doi:10.1145/1963405.1963487.
Examples
# Find 4 (approximate) nearest neighbors using Euclidean distance
# If you pass a data frame, non-numeric columns are removed
iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean")
# Manhattan (l1) distance
iris_nn <- nnd_knn(iris, k = 4, metric = "manhattan")
# Multi-threading: you can choose the number of threads to use: in real
# usage, you will want to set n_threads to at least 2
iris_nn <- nnd_knn(iris, k = 4, metric = "manhattan", n_threads = 1)
# Use verbose flag to see information about progress
iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", verbose = TRUE)
# Nearest neighbor descent uses random initialization, but you can pass any
# approximation using the init argument (as long as the metrics used to
# calculate the initialization are compatible with the metric options used
# by nnd_knn).
iris_nn <- random_knn(iris, k = 4, metric = "euclidean")
iris_nn <- nnd_knn(iris, init = iris_nn, metric = "euclidean", verbose = TRUE)
# Number of iterations controls how much optimization is attempted. A smaller
# value will run faster but give poorer results
iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", n_iters = 2)
# You can also control the amount of work done within an iteration by
# setting max_candidates
iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", max_candidates = 50)
# Optimization may also stop early if not much progress is being made. This
# convergence criterion can be controlled via delta. A larger value will
# stop progress earlier. The verbose flag will provide some information if
# convergence is occurring before all iterations are carried out.
set.seed(1337)
iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", n_iters = 5, delta = 0.5)
# To ensure that descent only stops if no improvements are made, set delta = 0
set.seed(1337)
iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", n_iters = 5, delta = 0)
# A faster version of the algorithm is available that avoids repeated
# distance calculations at the cost of using more RAM. Set low_memory to
# FALSE to try it.
set.seed(1337)
iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", low_memory = FALSE)
# Using init = "tree" is usually more efficient than random initialization.
# arguments to the tree initialization method can be passed via the init_args
# list
set.seed(1337)
iris_nn <- nnd_knn(iris, k = 4, init = "tree", init_args = list(n_trees = 5))