get_cos_sim {conText}R Documentation

Given a tokenized corpus, compute the cosine similarities of the resulting ALC embeddings and a defined set of features.

Description

This is a wrapper function for cos_sim() that allows users to go from a tokenized corpus to results with the option to bootstrap cosine similarities and get the corresponding std. errors.

Usage

get_cos_sim(
  x,
  groups = NULL,
  features = character(0),
  pre_trained,
  transform = TRUE,
  transform_matrix,
  bootstrap = TRUE,
  num_bootstraps = 100,
  confidence_level = 0.95,
  stem = FALSE,
  language = "porter",
  as_list = TRUE
)

Arguments

x

a (quanteda) tokens-class object

groups

(numeric, factor, character) a binary variable of the same length as x

features

(character) features of interest

pre_trained

(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.

transform

(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings.

transform_matrix

(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.

bootstrap

(logical) if TRUE, use bootstrapping – sample from texts with replacement and re-estimate cosine similarities for each sample. Required to get std. errors. If groups defined, sampling is automatically stratified.

num_bootstraps

(integer) number of bootstraps to use.

confidence_level

(numeric in (0,1)) confidence level e.g. 0.95

stem

(logical) - If TRUE, both features and rownames(pre_trained) are stemmed and average cosine similarities are reported. We recommend you remove misspelled words from pre_trained as these can significantly influence the average.

language

the name of a recognized language, as returned by getStemLanguages, or a two- or three-letter ISO-639 code corresponding to one of these languages (see references for the list of codes).

as_list

(logical) if FALSE all results are combined into a single data.frame If TRUE, a list of data.frames is returned with one data.frame per feature.

Value

a data.frame or list of data.frames (one for each target) with the following columns:

target

(character) rownames of x, the labels of the ALC embeddings.

feature

(character) feature terms defined in the features argument.

value

(numeric) cosine similarity between x and feature. Average over bootstrapped samples if bootstrap = TRUE.

std.error

(numeric) std. error of the similarity value. Column is dropped if bootstrap = FALSE.

lower.ci

(numeric) (if bootstrap = TRUE) lower bound of the confidence interval.

upper.ci

(numeric) (if bootstrap = TRUE) upper bound of the confidence interval.

Examples


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigration", window = 6L)

# sample 100 instances of the target term, stratifying by party (only for example purposes)
set.seed(2022L)
immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party'))

# compute the cosine similarity between each group's embedding
# and a specific set of features
set.seed(2021L)
get_cos_sim(x = immig_toks,
            groups = docvars(immig_toks, 'party'),
            features = c("reform", "enforce"),
            pre_trained = cr_glove_subset,
            transform = TRUE,
            transform_matrix = cr_transform,
            bootstrap = TRUE,
            num_bootstraps = 100,
            confidence_level = 0.95,
            stem = TRUE,
            as_list = FALSE)

[Package conText version 1.4.3 Index]