sim.wordlist {qlcMatrix} | R Documentation |
Similarity matrices from wordlists
Description
A few different approaches are implemented here to compute similarities from wordlists. sim.lang
computes similarities between languages, assuming a harmonized orthography (i.e. symbols can be equated across languages). sim.con
computes similarities between concepts, using only language-internal similarities. sim.graph
computes similarities between graphemes (i.e. language-specific symbols) between languages, as a crude approximation of regular sound correspondences.
WARNING: All these methods are really very crude! If they seem to give expected results, then this should be a lesson to rethink more complex methods proposed in the literature. However, in most cases the methods implemented here should be taken as a proof-of-concept, showing that such high-level similarities can be computed efficiently for large datasets. For actual research, I strongly urge anybody to adapt the current methods, and fine-tune them as needed.
Usage
sim.lang(wordlist,
doculects = "DOCULECT", concepts = "CONCEPT", counterparts = "COUNTERPART",
method = "parallel", assoc.method = res, weight = NULL, sep = "")
sim.con(wordlist,
doculects = "DOCULECT", concepts = "CONCEPT", counterparts = "COUNTERPART",
method = "bigrams", assoc.method = res, weight = NULL, sep = "")
sim.graph(wordlist,
doculects = "DOCULECT", concepts = "CONCEPT", counterparts = "TOKENS",
method = "cooccurrence", assoc.method = poi, weight = NULL, sep = " ")
Arguments
wordlist |
Dataframe or matrix containing the wordlist data. Should have at least columns corresponding to languages (DOCULECT), meanings (CONCEPT) and translations (COUNTERPART). |
doculects , concepts , counterparts |
The name (or number) of the column of |
method |
Specific approach for the computation of the similarities. See Details below. |
assoc.method , weight |
Measures to be used internally (passed on to |
sep |
Separator to be used to split strings. See |
Details
The following methods are currently implemented (all methods can be abbreviated):
For sim.lang
:
global
:Global bigram similarity, i.e. ignoring the separation into concepts, and simply taking the bigram vector of all words per language. Probably best combined with
weight = idf
.parallel
:By default, computes a parallel bigram similarity, i.e. splitting the bigram vectors per language and per concepts, and then simply making one long vector per language from all individual concept-bigram vectors. This approach seems to be very similar (if not slightly better) than the widespread ‘average Levenshtein’ distance.
For sim.con
:
colexification
:Simply count the number of languages in which two concepts have at least one complete identical translations. No normalization is attempted, and
assoc.method
andweight
are ignored (internally this just usestcrossprod
on theCW (concepts x words)
sparse matrix). Because no splitting of strings is necessary, this method is very quick.global
:Global bigram similarity, i.e. ignoring the separation into languages, and simply taking the bigram vector of all words per concept. Probably best combined with
weight = idf
.bigrams
:By default, compute the similarity between concepts by comparing bigraphs, i.e. language-specific bigrams. In that way, cross-linguistically recurrent partial similarities are uncovered. It is very interesting to compare this measure with
colexification
above.
For sim.graph
:
cooccurrence
:Currently the only method implemented. Computes the co-occurrence statistics for all pair of graphemes (e.g. between symbol x from language L1 and symbol y from language L2). See Prokic & Cysouw (2013) for an example using this approach.
All these methods (except for sim.con(method = "colexification")
) use either assocSparse
or cosSparse
for the computation of the similarities. For the different measures available, see the documentation there. Currently implemented are res, poi, pmi, wpmi
for assocSparse
and idf, isqrt, none
for cosWeight
. It is actually very easy to define your own measure.
When weight = NULL
, then assocSparse
is used with the internal method as specified in assoc.method
. When weight
is specified, then cosSparse
is used with an Euclidean norm and the weighting as specified in weight
. When weight
is specified, and specification of assoc.method
is ignored.
Value
A sparse similarity matrix of class dsCMatrix
. The magnitude of the actual values in the matrices depend strongly on the methods chosen.
With sim.graph
a list of two matrices is returned.
GG |
The grapheme by grapheme similarity matrix of class |
GD |
A pattern matrix of class indicating which grapheme belongs to which language. |
Author(s)
Michael Cysouw
References
Prokic, Jelena and Michael Cysouw. 2013. Combining regular sound correspondences and geographic spread. Language Dynamics and Change 3(2). 147–168.
See Also
Based on splitWordlist
for the underlying conversion of the wordlist into sparse matrices. The actual similarities are mostly computed using assocSparse
or cosSparse
.
Examples
# ----- load data -----
# an example wordlist, see help(huber) for details
data(huber)
# ----- similarity between languages -----
# most time is spend splitting the strings
# the rest does not really influence the time needed
system.time( sim <- sim.lang(huber, method = "p") )
# a simple distance-based UPGMA tree
## Not run:
# note non-ASCII characters in data might lead to plot errors on some platforms
plot(hclust(as.dist(-sim), method = "average"), cex = .7)
## End(Not run)
# ----- similarity between concepts -----
# similarity based on bigrams
system.time( simB <- sim.con(huber, method = "b") )
# similarity based on colexification. much easier to calculate
system.time( simC <- sim.con(huber, method = "c") )
# As an example, look at all adjectival concepts
adj <- c(1,5,13,14,28,35,40,48,67,89,105,106,120,131,137,146,148,
171,179,183,188,193,195,206,222,234,259,262,275,279,292,
294,300,309,341,353,355,359)
# show them as trees
## Not run:
# note non-ASCII characters in data might lead to plot errors on some platforms
oldpar<-par("mfrow")
par(mfrow = c(1,2))
plot(hclust(as.dist(-simB[adj,adj]), method = "ward.D2"),
cex = .5, main = "bigrams")
plot(hclust(as.dist(-simC[adj,adj]), method = "ward.D2"),
cex = .5, main = "colexification")
par(mfrow = oldpar)
## End(Not run)
# ----- similarity between graphemes -----
# this is a very crude approach towards regular sound correspondences
# when the languages are not too distantly related, it works rather nicely
# can be used as a quick first guess of correspondences for input in more advanced methods
# all 2080 graphemes in the data by all 2080 graphemes, from all languages
system.time( X <- sim.graph(huber) )
# throw away the low values
# select just one pair of languages for a quick visualisation
X$GG <- drop0(X$GG, tol = 1)
colnames(X$GG) <- rownames(X$GG)
correspondences <- X$GG[X$GD[,"bora"],X$GD[,"muinane"]]
## Not run:
# note non-ASCII characters in data might lead to plot errors on some platforms
heatmap(as.matrix(correspondences))
## End(Not run)