CMDist {text2map} | R Documentation |
Calculate Concept Mover's Distance
Description
Concept Mover's Distance classifies documents of any length along a continuous measure of engagement with a given concept of interest using word embeddings.
Usage
CMDist(
dtm,
cw = NULL,
cv = NULL,
wv,
missing = "stop",
scale = TRUE,
sens_interval = FALSE,
alpha = 1,
n_iters = 20L,
parallel = FALSE,
threads = 2L,
setup_timeout = 120L
)
cmdist(
dtm,
cw = NULL,
cv = NULL,
wv,
missing = "stop",
scale = TRUE,
sens_interval = FALSE,
alpha = 1,
n_iters = 20L,
parallel = FALSE,
threads = 2L,
setup_timeout = 120L
)
Arguments
dtm |
Document-term matrix with words as columns. Works with DTMs
produced by any popular text analysis package, or using the
|
cw |
Vector with concept word(s) (e.g., |
cv |
Concept vector(s) as output from |
wv |
Matrix of word embedding vectors (a.k.a embedding model) with rows as words. |
missing |
Indicates what action to take if words are not in embeddings.
If |
scale |
Logical (default = |
sens_interval |
logical (default = |
alpha |
If |
n_iters |
If |
parallel |
Logical (default = |
threads |
If |
setup_timeout |
If |
Details
CMDist()
requires three things: a (1) document-term matrix (DTM), a (2)
matrix of word embedding vectors, and (3) concept words or concept vectors.
The function uses word counts from the DTM and word similarities
from the cosine similarity of their respective word vectors in a
word embedding model. The "cost" of transporting all the words in a
document to a single vector or a few vectors (denoting a
concept of interest) is the measure of engagement, with higher costs
indicating less engagement. For intuitiveness the output of CMDist()
is inverted such that higher numbers will indicate more engagement
with a concept of interest.
The vector, or vectors, of the concept are specified in several ways. The simplest involves selecting a single word from the word embeddings, the analyst can also specify the concept by indicating a few words. The algorithm then splits the overall flow between each concept word (roughly) depending on which word in the document is nearest. The words need not be in the DTM, but they must be in the word embeddings (the function will either stop or remove words not in the embeddings).
Instead of selecting a word already in the embedding space, the function can
also take a vector extracted from the embedding space in the form of a
centroid (which averages the vectors of several words) ,a direction (which
uses the offset of several juxtaposing words), or a region (which is built
by clustering words into $k$ regions). The get_centroid()
,
get_direction()
, and get_regions()
functions will extract these.
Value
Returns a data frame with the first column as document ids and each
subsequent column as the CMD engagement corresponding to each
concept word or concept vector. The upper and lower bound
estimates will follow each unique CMD if sens_interval = TRUE
.
Author(s)
Dustin Stoltz and Marshall Taylor
References
Stoltz, Dustin S., and Marshall A. Taylor. (2019)
'Concept Mover's Distance' Journal of Computational
Social Science 2(2):293-313.
doi:10.1007/s42001-019-00048-6.
Taylor, Marshall A., and Dustin S. Stoltz. (2020) 'Integrating semantic
directions with concept mover's distance to measure binary concept
engagement.' Journal of Computational Social Science 1-12.
doi:10.1007/s42001-020-00075-8.
Taylor, Marshall A., and Dustin S. Stoltz.
(2020) 'Concept Class Analysis: A Method for Identifying Cultural
Schemas in Texts.' Sociological Science 7:544-569.
doi:10.15195/v7.a23.
See Also
CoCA, get_direction, get_centroid
Examples
# load example word embeddings
data(ft_wv_sample)
# load example text
data(jfk_speech)
# minimal preprocessing
jfk_speech$sentence <- tolower(jfk_speech$sentence)
jfk_speech$sentence <- gsub("[[:punct:]]+", " ", jfk_speech$sentence)
# create DTM
dtm <- dtm_builder(jfk_speech, sentence, sentence_id)
# example 1
cm.dists <- CMDist(dtm,
cw = "space",
wv = ft_wv_sample
)
# example 2
space <- c("spacecraft", "rocket", "moon")
cen <- get_centroid(anchors = space, wv = ft_wv_sample)
cm.dists <- CMDist(dtm,
cv = cen,
wv = ft_wv_sample
)