contrast_nns {conText} | R Documentation |
Contrast nearest neighbors
Description
Computes the ratio of cosine similarities between group embeddings and features –that is, for any given feature it first computes the similarity between that feature and each group embedding, and then takes the ratio of these two similarities. This ratio captures how "discriminant" a feature is of a given group.
Usage
contrast_nns(
x,
groups = NULL,
pre_trained = NULL,
transform = TRUE,
transform_matrix = NULL,
bootstrap = TRUE,
num_bootstraps = 100,
confidence_level = 0.95,
permute = TRUE,
num_permutations = 100,
candidates = NULL,
N = 20,
verbose = TRUE
)
Arguments
x |
(quanteda) |
groups |
(numeric, factor, character) a binary variable of the same length as |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
transform |
(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings. |
transform_matrix |
(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings. |
bootstrap |
(logical) if TRUE, use bootstrapping – sample from texts with replacement and re-estimate cosine ratios for each sample. Required to get std. errors. |
num_bootstraps |
(numeric) - number of bootstraps to use |
confidence_level |
(numeric in (0,1)) confidence level e.g. 0.95 |
permute |
(logical) - if TRUE, compute empirical p-values using a permutation test |
num_permutations |
(numeric) - number of permutations to use |
candidates |
(character) vector of candidate features for nearest neighbors |
N |
(numeric) - nearest neighbors are subset to the union of the N neighbors of each group (if NULL, ratio is computed for all features) |
verbose |
(logical) - if TRUE, report the documents that had no overlapping features with the pretrained embeddings provided. |
Value
a data.frame with following columns:
feature
(character) vector of feature terms corresponding to the nearest neighbors.
value
(numeric) ratio of cosine similarities. Average over bootstrapped samples if bootstrap = TRUE.
std.error
(numeric) std. error of the ratio of cosine similarties. Column is dropped if bootsrap = FALSE.
lower.ci
(numeric) (if bootstrap = TRUE) lower bound of the confidence interval.
upper.ci
(numeric) (if bootstrap = TRUE) upper bound of the confidence interval.
p.value
(numeric) empirical p-value. Column is dropped if permute = FALSE.
Examples
library(quanteda)
cr_toks <- tokens(cr_sample_corpus)
immig_toks <- tokens_context(x = cr_toks,
pattern = "immigration", window = 6L, hard_cut = FALSE, verbose = TRUE)
# sample 100 instances of the target term, stratifying by party (only for example purposes)
set.seed(2022L)
immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party'))
set.seed(42L)
party_nns <- contrast_nns(x = immig_toks,
groups = docvars(immig_toks, 'party'),
pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform,
bootstrap = TRUE,
num_bootstraps = 100,
confidence_level = 0.95,
permute = TRUE, num_permutations = 10,
candidates = NULL, N = 20,
verbose = FALSE)
head(party_nns)