dem {conText}R Documentation

Build a document-embedding matrix

Description

Given a document-feature-matrix, for each document, multiply its feature counts (columns) with their corresponding pretrained word embeddings and average (usually referred to as averaged or additive document embeddings). If specified and a transformation matrix is provided, multiply the document embeddings by the transformation matrix to obtain the corresponding ⁠a la carte⁠ document embeddings. (see eq 2: https://arxiv.org/pdf/1805.05388.pdf)

Usage

dem(x, pre_trained, transform = TRUE, transform_matrix, verbose = TRUE)

Arguments

x

a quanteda (dfm-class) document-feature-matrix

pre_trained

(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.

transform

(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings.

transform_matrix

(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.

verbose

(logical) - if TRUE, report the documents that had no overlapping features with the pretrained embeddings provided.

Value

a N x D (dem-class) document-embedding-matrix corresponding to the ALC embeddings for each document. N = number of documents (that could be embedded), D = dimensions of pretrained embeddings. This object inherits the document variables in x, the dfm used. These can be accessed calling the attribute: ⁠@docvars⁠. Note, documents with no overlapping features with the pretrained embeddings provided are automatically dropped. For a list of the documents that were embedded call the attribute: ⁠@Dimnames$docs⁠.

Examples


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L)

# construct document-feature-matrix
immig_dfm <- dfm(immig_toks)

# construct document-embedding-matrix
immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)

[Package conText version 1.4.3 Index]