sim.words {qlcMatrix} | R Documentation |
Similarity-measures for words between two languages, based on co-occurrences in parallel text
Description
Based on co-occurrences in a parallel text, this convenience function (a wrapper around various other functions from this package) efficiently computes something close to translational equivalence.
Usage
sim.words(text1, text2 = NULL, method = res, weight = NULL,
lowercase = TRUE, best = FALSE, tol = 0)
Arguments
text1 , text2 |
Vectors of strings representing sentences. The names of the vectors should contain IDs that identify the parallelism between the two texts. If there are no specific names, the function assumes that the two vectors are perfectly parallel. Within the strings, wordforms are simply separated based on spaces (i.e. everything between two spaces is a wordform). For more details about the format-assumptions, see |
method |
Method to be used as a co-occurrence statistic. See |
weight |
When |
lowercase |
Should all words be turned into lowercase? See |
best |
When |
tol |
Tolerance: remove all values between |
Details
Care is taken in this function to match multiple verses that are translated into one verse, see bibles
for a survey of the encoding assumptions taken here.
The parameter method
can take anything that is also available for assocSparse
. Similarities are computed using that function.
When weight
is specified, the similarities are computed using cosSparse
with default setting of norm = norm2
. All available weights can also be used here.
The option best = T
uses rowMax
and colMax
. This approach to get the ‘best’ translation is really crude, but it works reasonably well with one-to-one and many-to-one situations. This option takes rather a lot more time to finish, as row-wise maxima for matrices is not trivial to optimize. Consider raising tol
, as this removes low values that won't be important for the maxima anyway. See examples below.
Guidelines for the value of tol
are difficult to give, as it depends on the method used, but also on the distribution of the data (i.e. the number of sentences, and the frequency distribution of the words in the text). Some suggestions:
when
weight
is specified, results range between -1 and +1. Thentol = 0.1
should never lead to problems, but often eventol = 0.3
or higher will lead to identical results.when
weight
is not specified (i.e.assocSparse
will be used), then results range between-inf
and+inf
, so the tolerance is more problematic. In general,tol = 2
seems to be unproblematic. Higher tolerance, e.g.tol = 10
can be used to find the ‘obvious’ translations, but you will loose some of the more incidental co-occurrences.
Value
When best = F
, a single sparse matrix is returned of type dgCMatrix
with the values of the statistic chosen. All unique wordforms of text1 are included as row names, and those from text2 as column names.
When best = T
, a list of two sparse matrices is returned:
sim |
the same matrix as above |
best |
a sparse pattern matrix of type |
Author(s)
Michael Cysouw
References
Mayer, Thomas and Michael Cysouw. 2012. Language comparison through sparse multilingual word alignment. Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, 54–62. Avignon: Association for Computational Linguistics.
See Also
splitText
, assocSparse
and cosSparse
are the central parts of this function. Also check rowMax
, which is used to extract the ‘best’ translations.
Examples
data(bibles)
# ----- small example of co-occurrences -----
# as an example, just take partially overlapping parts of two bibles
# sim.words uses the names to get the paralellism right, so this works
eng <- bibles$eng[1:5000]
deu <- bibles$deu[2000:7000]
sim <- sim.words(eng, deu, method = res)
# but the statistics are not perfect (because too little data)
# sorted co-occurrences for the english word "your" in German:
sort(sim["your",], decreasing = TRUE)[1:10]
# ----- complete example of co-occurrences -----
# running the complete bibles takes a bit more time (but still manageable)
system.time(sim <- sim.words(bibles$eng, bibles$deu, method = res))
# results are much better
# sorted co-occurrences for the english word "your" in German:
sort(sim["your",], decreasing = TRUE)[1:10]
# ----- look for 'best' translations -----
# note that selecting the 'best' takes even more time
system.time(sim2 <- sim.words(bibles$eng, bibles$deu, method = res, best = TRUE))
# best co-occurrences for the English word "your"
which(sim2$best["your",])
# but can be made faster by removing low values
# (though the boundary in \code{tol = 5} depends on the method used
system.time(sim3 <- sim.words(bibles$eng, bibles$deu, best = TRUE, method = res, tol = 5))
# note that the decision on the 'best' remains the same here
all.equal(sim2$best, sim3$best)
# ----- computations also work with other languages -----
# All works completely language-independent
# translations for 'we' in Tagalog:
sim <- sim.words(bibles$eng, bibles$tgl, best = TRUE, weight = idf, tol = 0.1)
which(sim$best["we",])