nns_ratio {conText} | R Documentation |
Computes the ratio of cosine similarities for two embeddings over the union of their respective top N nearest neighbors.
Description
Computes the ratio of cosine similarities between group embeddings and features –that is, for any given feature it first computes the similarity between that feature and each group embedding, and then takes the ratio of these two similarities. This ratio captures how "discriminant" a feature is of a given group. Values larger (smaller) than 1 mean the feature is more (less) discriminant of the group in the numerator (denominator).
Usage
nns_ratio(
x,
N = 10,
numerator = NULL,
candidates = character(0),
pre_trained,
stem = FALSE,
language = "porter",
verbose = TRUE,
show_language = TRUE
)
Arguments
x |
a (quanteda) |
N |
(numeric) number of nearest neighbors to return. Nearest neighbors
consist of the union of the top N nearest neighbors of the embeddings in |
numerator |
(character) defines which group is the nuemerator in the ratio |
candidates |
(character) vector of features to consider as candidates to be nearest neighbor
You may for example want to only consider features that meet a certian count threshold
or exclude stop words etc. To do so you can simply identify the set of features you
want to consider and supply these as a character vector in the |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
stem |
(logical) - whether to stem candidates when evaluating nns. Default is FALSE.
If TRUE, candidate stems are ranked by their average cosine similarity to the target.
We recommend you remove misspelled words from candidate set |
language |
the name of a recognized language, as returned by
|
verbose |
report which group is the numerator and which group is the denominator. |
show_language |
(logical) if TRUE print out message with language used for stemming. |
Value
a data.frame
with following columns:
feature
(character) features in
candidates
(or all features ifcandidates
not defined), one instance for each embedding inx
.value
(numeric) ratio of cosine similarities.
Examples
library(quanteda)
# tokenize corpus
toks <- tokens(cr_sample_corpus)
# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L)
# build document-feature matrix
immig_dfm <- dfm(immig_toks)
# construct document-embedding-matrix
immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)
# to get group-specific embeddings, average within party
immig_wv_party <- dem_group(immig_dem, groups = immig_dem@docvars$party)
# compute the cosine similarity between each party's
# embedding and a specific set of features
nns_ratio(x = immig_wv_party, N = 10, numerator = "R",
candidates = immig_wv_party@features,
pre_trained = cr_glove_subset, verbose = FALSE)
# with stemming
nns_ratio(x = immig_wv_party, N = 10, numerator = "R",
candidates = immig_wv_party@features,
pre_trained = cr_glove_subset, stem = TRUE, verbose = FALSE)