find_transformation {text2map}R Documentation

Find a specified matrix transformation

Description

Given a matrix, BB, of word embedding vectors (source) with terms as rows, this function finds a transformed matrix following a specified operation. These include: centering (i.e. translation) and normalization (i.e. scaling). In the first, BB is centered by subtracting column means. In the second, BB is normalized by the L2 norm. Both have been found to improve word embedding representations. The function also finds a transformed matrix that approximately aligns BB, with another matrix, AA, of word embedding vectors (reference), using Procrustes transformation (see details). Finally, given a term-co-occurrence matrix built on a local corpus, the function can "retrofit" pretrained embeddings to better match the local corpus.

Usage

find_transformation(
  wv,
  ref = NULL,
  method = c("align", "norm", "center", "retrofit")
)

Arguments

wv

Matrix of word embedding vectors (a.k.a embedding model) with rows as terms (the source matrix to be transformed).

ref

If method = "align", this is the reference matrix toward which the source matrix is to be aligned.

method

Character vector indicating the method to use for the transformation. Current methods include: "align", "norm", "center", and "refrofit" – see details.

Details

Aligning a source matrix of word embedding vectors, BB, to a reference matrix, AA, has primarily been used as a post-processing step for embeddings trained on longitudinal corpora for diachronic analysis or for cross-lingual embeddings. Aligning preserves internal (cosine) distances, while orient the source embeddings to minimize the sum of squared distances (and is therefore a Least Squares problem). Alignment is accomplished with the following steps:

Alignment is asymmetrical, and only outputs the transformed source matrix, BB. Therefore, it is typically recommended to align BB to AA, and then AA to BB. However, simplying centering and norming AA after may be sufficient.

Value

A new word embedding matrix, transformed using the specified method.

References

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. (2018). 'A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings.' In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 789-798
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2019. 'An effective approach to unsupervised machine translation.' In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 194-203
Hamilton, William L., Jure Leskovec, and Dan Jurafsky. (2018). 'Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change.' https://arxiv.org/abs/1605.09096v6.
Lin, Zefeng, Xiaojun Wan, and Zongming Guo. (2019). 'Learning Diachronic Word Embeddings with Iterative Stable Information Alignment.' Natural Language Processing and Chinese Computing. 749-60. doi:10.1007/978-3-030-32233-5_58.
Schlechtweg et al. (2019). 'A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains.' https://arxiv.org/abs/1906.02979v1. Shoemark et a. (2019). 'Room to Glo: A Systematic Comparison of Semantic Change Detection Approaches with Word Embeddings.' Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 66-76. doi:10.18653/v1/D19-1007 Borg and Groenen. (1997). Modern Multidimensional Scaling. New York: Springer. 340-342


[Package text2map version 0.2.0 Index]