splitWordlist {qlcMatrix} | R Documentation |
Construct sparse matrices from comparative wordlists (aka ‘Swadesh list’)
Description
A comparative wordlist (aka ‘Swadesh list’) is a collection of wordforms from different languages, which are translations of a selected set of meanings. This function dismantles this data structure into a set of sparse matrices.
Usage
splitWordlist(data,
doculects = "DOCULECT", concepts = "CONCEPT", counterparts = "COUNTERPART",
splitstrings = TRUE, sep = "", bigram.binder = "", grapheme.binder = "_",
simplify = FALSE)
Arguments
data |
A dataframe or matrix with each row describing a combination of language (DOCULECT), meaning (CONCEPT) and translation (COUNTERPART). |
doculects , concepts , counterparts |
The name (or number) of the column of |
splitstrings |
Should the counterparts be separated into unigrams and bigrams (using |
sep |
Separator to be passed to |
bigram.binder |
Separator to be passed to |
grapheme.binder |
Separator to be used to separate a grapheme from the language name. Graphemes are language-specific symbols (i.e. the 'a' in the one language is not assumed to be the same as the 'a' from another language). |
simplify |
Should the output be reduced to the most important matrices only, with the row and columns names included in the matrices? Defaults to |
Details
The meanings that are selected for a wordlist are called CONCEPTS here, and the translations into the various languages COUNTERPARTS (following Poornima & Good 2010). The languages are called DOCULECTS (‘documented lects’) to generalize over their status as dialects, languages, or even small families (following Cysouw & Good 2013).
Value
There are four different possible outputs, depending on the option chosen.
By default, when splitstrings = T, simplify = F
, the following list of 15 objects is returned. It starts with 8 different character vectors, which are actually the row/column names of the following sparse pattern matrices. The naming of the objects is an attempt to make everything easy to remember.
doculects |
Character vector with names of doculects in the data |
concepts |
Character vector with names of concepts in the data |
words |
Character vector with all words, i.e. unique counterparts per language. The same string in the same language is only included once, but an identical string occurring in different doculect is separately included for each doculects. |
segments |
Character vector with all unigram-tokens in order of appearance, including boundary symbols and gap symbols (see |
unigrams |
Character vector with all unique unigrams in the data |
bigrams |
Character vector with all unique bigrams in the data |
graphemes |
Character vector with all unique graphemes (i.e. combinations of unigrams+doculects) occurring in the data |
digraphs |
Character vector with all unique digraphs (i.e. combinations of bigrams+doculects) occurring in the data |
DW |
Sparse pattern matrix of class |
CW |
Sparse pattern matrix of class |
SW |
Sparse pattern matrix of class |
US |
Sparse pattern matrix of class |
BS |
Sparse pattern matrix of class |
GS |
Sparse pattern matrix of class |
TS |
Sparse pattern matrix of class |
When splitstrings = F, simplify = F
, only the following objects from the above list are returned:
doculects |
Character vector with names of doculects in the data |
concepts |
Character vector with names of concepts in the data |
words |
Character vector with all words, i.e. unique counterparts per language. The same string in the same language is only included once, but an identical string occurring in different doculect is separately included for each doculects. |
DW |
Sparse pattern matrix of class |
CW |
Sparse pattern matrix of class |
When splitstrings = T, simplify = T
only the bigram-separation is returned, and all row and columns names are included into the matrices. However, for reasons of space, the words
vector is only included once:
DW |
Sparse pattern matrix of class |
CW |
Sparse pattern matrix of class |
BW |
Sparse pattern matrix of class |
Finally, when splitstrings = F, simplify = T
, only the following subset of the above is returned.
DW |
Sparse pattern matrix of class |
CW |
Sparse pattern matrix of class |
Note
Note that the default behavior probably overgenerates information (specifically when splitstrings = T
), and might be performing unnecessary computation for specific goals. In practice, it might be useful to tweak the underlying code (mainly by throwing out unnecessary steps) to optimize performance.
Author(s)
Michael Cysouw
References
Cysouw, Michael & Jeff Good. 2013. Languoid, Doculect, Glossonym: Formalizing the notion “language”. Language Documentation and Conservation 7. 331-359.
Poornima, Shakthi & Jeff Good. 2010. Modeling and Encoding Traditional Wordlists for Machine Applications. Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground.
See Also
sim.wordlist
for various quick similarities that can be computed using these matrices.
Examples
# ----- load data -----
# an example wordlist, see the help(huber) for details
data(huber)
# ----- show output -----
# a selection, to see the result of splitWordlist
# only show the simplified output here,
# the full output is rather long even for just these six words
sel <- c(1:3, 1255:1258)
splitWordlist(huber[sel,], simplify = TRUE)
# ----- split complete data -----
# splitting the complete wordlist is a lot of work !
# it won't get much quicker than this
# most time goes into the string-splitting of the almost 26,000 words
# Default version, included splitStrings:
system.time( H <- splitWordlist(huber) )
# Simplified version without splitStrings is much quicker:
system.time( H <- splitWordlist(huber, splitstrings = FALSE, simplify = TRUE) )
# ----- investigate colexification -----
# The simple version can be used to check how often two concepts
# are expressed identically across all languages ('colexification')
H <- splitWordlist(huber, splitstrings = FALSE, simplify = TRUE)
sim <- tcrossprod(H$CW*1)
# select only the frequent colexifications for a quick visualisation
diag(sim) <- 0
sim <- drop0(sim, tol = 5)
sim <- sim[rowSums(sim) > 0, colSums(sim) > 0]
## Not run:
# this might lead to errors on some platforms because of non-ASCII symbols
plot( hclust(as.dist(-sim), method = "average"), cex = .5)
## End(Not run)
# ----- investigate regular sound correspondences -----
# One central problem with data from many languages is the variation of orthography
# It is preferred to solve that problem separately
# e.g. check the column "TOKENS" in the huber data
# This is a grapheme-separated version of the data.
# can be used to investigate co-occurrence of graphemes (approx. phonemes)
H <- splitWordlist(huber, counterparts = "TOKENS", sep = " ")
# co-occurrence of all pairs of the 2150 different graphemes through all languages
system.time( G <- assocSparse( (H$CW*1) %*% t(H$SW*1) %*% t(H$GS*1), method = poi))
rownames(G) <- colnames(G) <- H$graphemes
G <- drop0(G, tol = 1)
# select only one language pair for a quick visualisation
# check the nice sound changes between bora and muinane!
GD <- H$GS %*% H$SW %*% t(H$DW)
colnames(GD) <- H$doculects
correspondences <- G[GD[,"bora"],GD[,"muinane"]]
## Not run:
# this might lead to errors on some platforms because of non-ASCII symbols
heatmap(as.matrix(correspondences))
## End(Not run)