get_cos_sim {conText} | R Documentation |
Given a tokenized corpus, compute the cosine similarities of the resulting ALC embeddings and a defined set of features.
Description
This is a wrapper function for cos_sim()
that allows users to go from a
tokenized corpus to results with the option to bootstrap cosine similarities
and get the corresponding std. errors.
Usage
get_cos_sim(
x,
groups = NULL,
features = character(0),
pre_trained,
transform = TRUE,
transform_matrix,
bootstrap = TRUE,
num_bootstraps = 100,
confidence_level = 0.95,
stem = FALSE,
language = "porter",
as_list = TRUE
)
Arguments
x |
a (quanteda) |
groups |
(numeric, factor, character) a binary variable of the same length as |
features |
(character) features of interest |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
transform |
(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings. |
transform_matrix |
(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings. |
bootstrap |
(logical) if TRUE, use bootstrapping – sample from texts with replacement and
re-estimate cosine similarities for each sample. Required to get std. errors.
If |
num_bootstraps |
(integer) number of bootstraps to use. |
confidence_level |
(numeric in (0,1)) confidence level e.g. 0.95 |
stem |
(logical) - If TRUE, both |
language |
the name of a recognized language, as returned by
|
as_list |
(logical) if FALSE all results are combined into a single data.frame If TRUE, a list of data.frames is returned with one data.frame per feature. |
Value
a data.frame
or list of data.frames (one for each target)
with the following columns:
target
(character) rownames of
x
, the labels of the ALC embeddings.feature
(character) feature terms defined in the
features
argument.value
(numeric) cosine similarity between
x
and feature. Average over bootstrapped samples if bootstrap = TRUE.std.error
(numeric) std. error of the similarity value. Column is dropped if bootstrap = FALSE.
lower.ci
(numeric) (if bootstrap = TRUE) lower bound of the confidence interval.
upper.ci
(numeric) (if bootstrap = TRUE) upper bound of the confidence interval.
Examples
library(quanteda)
# tokenize corpus
toks <- tokens(cr_sample_corpus)
# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigration", window = 6L)
# sample 100 instances of the target term, stratifying by party (only for example purposes)
set.seed(2022L)
immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party'))
# compute the cosine similarity between each group's embedding
# and a specific set of features
set.seed(2021L)
get_cos_sim(x = immig_toks,
groups = docvars(immig_toks, 'party'),
features = c("reform", "enforce"),
pre_trained = cr_glove_subset,
transform = TRUE,
transform_matrix = cr_transform,
bootstrap = TRUE,
num_bootstraps = 100,
confidence_level = 0.95,
stem = TRUE,
as_list = FALSE)