get_regions {text2map} | R Documentation |
Word embedding semantic region extractor
Description
Given a set of word embeddings of d
dimensions and v
vocabulary,
get_regions()
finds k
semantic regions in d
dimensions.
This, in effect, learns latent topics from an embedding space (a.k.a.
topic modeling), which are directly comparable to both terms (with
cosine similarity) and documents (with Concept Mover's distance
using CMDist()
).
Usage
get_regions(wv, k_regions = 5L, max_iter = 20L, seed = 0)
Arguments
wv |
Matrix of word embedding vectors (a.k.a embedding model) with rows as words. |
k_regions |
Integer indicating the k number of regions to return |
max_iter |
Integer indicating the maximum number of iterations before k-means terminates. |
seed |
Integer indicating a random seed. Default is 0, which calls 'std::time(NULL)'. |
Details
To group words into more encompassing "semantic regions" we use k
-means
clustering. We choose k
-means primarily for it's ubiquity and the wide
range of available diagnostic tools for k
-means cluster.
A word embedding matrix of d
dimensions and v
vocabulary is
"clustered" into k
semantic regions which have d
dimensions.
Each region is represented by a single point defined by the d
dimensional vector. The process discretely assigns all word vectors are
assigned to a given region so as to minimize some error function, however
as the resulting regions are in the same dimensions as the word embeddings,
we can measure each terms similarity to each region. This, in effect,
is a mixed membership topic model similar to topic modeling by Latent
Dirichlet Allocation.
We use the KMeans_arma
function from the ClusterR
package which
uses the Armadillo library.
Value
returns a matrix of class "dgCMatrix" with k rows and d dimensions
Author(s)
Dustin Stoltz
References
Butnaru, Andrei M., and Radu Tudor Ionescu. (2017)
'From image to text classification: A novel approach
based on clustering word embeddings.'
Procedia computer science. 112:1783-1792.
doi:10.1016/j.procs.2017.08.211.
Zhang, Yi, Jie Lu, Feng Liu, Qian Liu, Alan Porter,
Hongshu Chen, and Guangquan Zhang. (2018).
'Does Deep Learning Help Topic Extraction? A Kernel
K-Means Clustering Method with Word Embedding.'
Journal of Informetrics. 12(4):1099-1117.
doi:10.1016/j.joi.2018.09.004.
Arseniev-Koehler, Alina and Cochran, Susan D and
Mays, Vickie M and Chang, Kai-Wei and Foster,
Jacob Gates (2021) 'Integrating topic modeling
and word embedding to characterize violent deaths'
doi:10.31235/osf.io/nkyaq
Examples
# load example word embeddings
data(ft_wv_sample)
my.regions <- get_regions(
wv = ft_wv_sample,
k_regions = 10L,
max_iter = 10L,
seed = 01984
)