get_ncs {conText}R Documentation

Given a set of tokenized contexts, find the top N nearest contexts.

Description

This is a wrapper function for ncs() that allows users to go from a tokenized corpus to results with the option to bootstrap cosine similarities and get the corresponding std. errors.

Usage

get_ncs(
  x,
  N = 5,
  groups = NULL,
  pre_trained,
  transform = TRUE,
  transform_matrix,
  bootstrap = TRUE,
  num_bootstraps = 100,
  confidence_level = 0.95,
  as_list = TRUE
)

Arguments

x

a (quanteda) tokens-class object

N

(numeric) number of nearest contexts to return

groups

a character or factor variable equal in length to the number of documents

pre_trained

(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.

transform

(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings.

transform_matrix

(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.

bootstrap

(logical) if TRUE, use bootstrapping – sample from x with replacement and re-estimate cosine similarities for each sample. Required to get std. errors. If groups defined, sampling is automatically stratified.

num_bootstraps

(integer) number of bootstraps to use.

confidence_level

(numeric in (0,1)) confidence level e.g. 0.95

as_list

(logical) if FALSE all results are combined into a single data.frame If TRUE, a list of data.frames is returned with one data.frame per embedding

Value

a data.frame or list of data.frames (one for each target) with the following columns:

target

(character) rownames of x, the labels of the ALC embeddings. NA if is.null(rownames(x)).

context

(character) contexts collapsed into single documents (i.e. untokenized).

rank

(character) rank of context in terms of similarity with x.

value

(numeric) cosine similarity between x and context.

std.error

(numeric) std. error of the similarity value. Column is dropped if bootstrap = FALSE.

lower.ci

(numeric) (if bootstrap = TRUE) lower bound of the confidence interval.

upper.ci

(numeric) (if bootstrap = TRUE) upper bound of the confidence interval.

Examples


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigration",
window = 6L, rm_keyword = FALSE)

# sample 100 instances of the target term, stratifying by party (only for example purposes)
set.seed(2022L)
immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party'))

# compare nearest contexts between groups
set.seed(2021L)
immig_party_ncs <- get_ncs(x = immig_toks,
                           N = 10,
                           groups = docvars(immig_toks, 'party'),
                           pre_trained = cr_glove_subset,
                           transform = TRUE,
                           transform_matrix = cr_transform,
                           bootstrap = TRUE,
                           num_bootstraps = 100,
                           confidence_level = 0.95,
                           as_list = TRUE)

# nearest neighbors of "immigration" for Republican party
immig_party_ncs[["D"]]

[Package conText version 1.4.3 Index]