find_transformation {text2map} | R Documentation |
Find a specified matrix transformation
Description
Given a matrix, , of word embedding vectors (source) with
terms as rows, this function finds a transformed matrix following a
specified operation. These include: centering (i.e.
translation) and normalization (i.e. scaling). In the first,
is
centered by subtracting column means. In the second,
is
normalized by the L2 norm. Both have been found to improve
word embedding representations. The function also finds a transformed
matrix that approximately aligns
, with another matrix,
, of word embedding vectors (reference), using Procrustes
transformation (see details). Finally, given a term-co-occurrence matrix
built on a local corpus, the function can "retrofit" pretrained
embeddings to better match the local corpus.
Usage
find_transformation(
wv,
ref = NULL,
method = c("align", "norm", "center", "retrofit")
)
Arguments
wv |
Matrix of word embedding vectors (a.k.a embedding model) with rows as terms (the source matrix to be transformed). |
ref |
If |
method |
Character vector indicating the method to use for the transformation. Current methods include: "align", "norm", "center", and "refrofit" – see details. |
Details
Aligning a source matrix of word embedding vectors, , to a
reference matrix,
, has primarily been used as a post-processing step
for embeddings trained on longitudinal corpora for diachronic analysis
or for cross-lingual embeddings. Aligning preserves internal (cosine)
distances, while orient the source embeddings to minimize the sum of squared
distances (and is therefore a Least Squares problem).
Alignment is accomplished with the following steps:
translation: centering by column means
scaling: scale (normalizes) by the L2 Norm
rotation/reflection: rotates and a reflects to minimize sum of squared differences, using singular value decomposition
Alignment is asymmetrical, and only outputs the transformed source matrix,
. Therefore, it is typically recommended to align
to
,
and then
to
. However, simplying centering and norming
after may be sufficient.
Value
A new word embedding matrix, transformed using the specified method.
References
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. (2018).
'A robust self-learning method for fully unsupervised
cross-lingual mappings of word embeddings.' In Proceedings
of the 56th Annual Meeting of the Association for
Computational Linguistics. 789-798
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2019.
'An effective approach to unsupervised machine translation.'
In Proceedings of the 57th Annual Meeting of the Association
for Computational Linguistics. 194-203
Hamilton, William L., Jure Leskovec, and Dan Jurafsky. (2018).
'Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change.'
https://arxiv.org/abs/1605.09096v6.
Lin, Zefeng, Xiaojun Wan, and Zongming Guo. (2019).
'Learning Diachronic Word Embeddings with Iterative Stable
Information Alignment.' Natural Language Processing and
Chinese Computing. 749-60. doi:10.1007/978-3-030-32233-5_58.
Schlechtweg et al. (2019). 'A Wind of Change: Detecting and
Evaluating Lexical Semantic Change across Times and Domains.'
https://arxiv.org/abs/1906.02979v1.
Shoemark et a. (2019). 'Room to Glo: A Systematic Comparison
of Semantic Change Detection Approaches with Word Embeddings.'
Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing. 66-76. doi:10.18653/v1/D19-1007
Borg and Groenen. (1997). Modern Multidimensional Scaling.
New York: Springer. 340-342