R: Word embedding semantic region extractor

get_regions {text2map}

R Documentation

Word embedding semantic region extractor

Description

Given a set of word embeddings of d dimensions and v vocabulary, get_regions() finds k semantic regions in d dimensions. This, in effect, learns latent topics from an embedding space (a.k.a. topic modeling), which are directly comparable to both terms (with cosine similarity) and documents (with Concept Mover's distance using CMDist()).

Usage

get_regions(wv, k_regions = 5L, max_iter = 20L, seed = 0)

Arguments

`wv`	Matrix of word embedding vectors (a.k.a embedding model) with rows as words.
`k_regions`	Integer indicating the k number of regions to return
`max_iter`	Integer indicating the maximum number of iterations before k-means terminates.
`seed`	Integer indicating a random seed. Default is 0, which calls 'std::time(NULL)'.

Details

To group words into more encompassing "semantic regions" we use k-means clustering. We choose k-means primarily for it's ubiquity and the wide range of available diagnostic tools for k-means cluster.

A word embedding matrix of d dimensions and v vocabulary is "clustered" into k semantic regions which have d dimensions. Each region is represented by a single point defined by the d dimensional vector. The process discretely assigns all word vectors are assigned to a given region so as to minimize some error function, however as the resulting regions are in the same dimensions as the word embeddings, we can measure each terms similarity to each region. This, in effect, is a mixed membership topic model similar to topic modeling by Latent Dirichlet Allocation.

We use the KMeans_arma function from the ClusterR package which uses the Armadillo library.

Value

returns a matrix of class "dgCMatrix" with k rows and d dimensions

Author(s)

Dustin Stoltz

References

Butnaru, Andrei M., and Radu Tudor Ionescu. (2017) 'From image to text classification: A novel approach based on clustering word embeddings.' Procedia computer science. 112:1783-1792. doi:10.1016/j.procs.2017.08.211.
Zhang, Yi, Jie Lu, Feng Liu, Qian Liu, Alan Porter, Hongshu Chen, and Guangquan Zhang. (2018). 'Does Deep Learning Help Topic Extraction? A Kernel K-Means Clustering Method with Word Embedding.' Journal of Informetrics. 12(4):1099-1117. doi:10.1016/j.joi.2018.09.004.
Arseniev-Koehler, Alina and Cochran, Susan D and Mays, Vickie M and Chang, Kai-Wei and Foster, Jacob Gates (2021) 'Integrating topic modeling and word embedding to characterize violent deaths' doi:10.31235/osf.io/nkyaq

Examples


# load example word embeddings
data(ft_wv_sample)

my.regions <- get_regions(
  wv = ft_wv_sample,
  k_regions = 10L,
  max_iter = 10L,
  seed = 01984
)

[Package text2map version 0.2.0 Index]