bootstrap_nns {conText} | R Documentation |
Bootstrap nearest neighbors
Description
Uses bootstrapping –sampling of of texts with replacement– to identify the top N nearest neighbors based on cosine or inner product similarity.
Usage
bootstrap_nns(
context = NULL,
pre_trained = NULL,
transform = TRUE,
transform_matrix = NULL,
candidates = NULL,
bootstrap = TRUE,
num_bootstraps = 100,
confidence_level = 0.95,
N = 50,
norm = "l2"
)
Arguments
context |
(character) vector of texts - |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
transform |
(logical) - if TRUE (default) apply the a la carte transformation, if FALSE ouput untransformed averaged embedding. |
transform_matrix |
(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings. |
candidates |
(character) vector defining the candidates for nearest neighbors - e.g. output from |
bootstrap |
(logical) if TRUE, bootstrap similarity values - sample from texts with replacement. Required to get std. errors. |
num_bootstraps |
(numeric) - number of bootstraps to use. |
confidence_level |
(numeric in (0,1)) confidence level e.g. 0.95 |
N |
(numeric) number of nearest neighbors to return. |
norm |
(character) - how to compute the similarity (see ?text2vec::sim2):
|
Value
a data.frame
with the following columns:
feature
(character) vector of feature terms corresponding to the nearest neighbors.
value
(numeric) cosine/inner product similarity between texts and feature. Average over bootstrapped samples if bootstrap = TRUE.
std.error
(numeric) std. error of the similarity value. Column is dropped if bootstrap = FALSE.
lower.ci
(numeric) (if bootstrap = TRUE) lower bound of the confidence interval.
upper.ci
(numeric) (if bootstrap = TRUE) upper bound of the confidence interval.
Examples
# find contexts of immigration
context_immigration <- get_context(x = cr_sample_corpus,
target = 'immigration',
window = 6,
valuetype = "fixed",
case_insensitive = TRUE,
hard_cut = FALSE, verbose = FALSE)
# find local vocab (use it to define the candidate of nearest neighbors)
local_vocab <- get_local_vocab(context_immigration$context, pre_trained = cr_glove_subset)
set.seed(42L)
nns_immigration <- bootstrap_nns(context = context_immigration$context,
pre_trained = cr_glove_subset,
transform_matrix = cr_transform,
transform = TRUE,
candidates = local_vocab,
bootstrap = TRUE,
num_bootstraps = 100,
confidence_level = 0.95,
N = 50,
norm = "l2")
head(nns_immigration)