R: Construct sparse matrices from comparative wordlists (aka...

splitWordlist {qlcMatrix}

R Documentation

Construct sparse matrices from comparative wordlists (aka ‘Swadesh list’)

Description

A comparative wordlist (aka ‘Swadesh list’) is a collection of wordforms from different languages, which are translations of a selected set of meanings. This function dismantles this data structure into a set of sparse matrices.

Usage

splitWordlist(data,
	doculects = "DOCULECT", concepts = "CONCEPT", counterparts = "COUNTERPART",
	splitstrings = TRUE, sep =  "", bigram.binder = "", grapheme.binder = "_", 
	simplify = FALSE)

Arguments

`data`	A dataframe or matrix with each row describing a combination of language (DOCULECT), meaning (CONCEPT) and translation (COUNTERPART).
`doculects`, `concepts`, `counterparts`	The name (or number) of the column of `data` in which the respective information is to be found. The defaults are set to coincide with the naming of the example dataset included in this package: `huber`.
`splitstrings`	Should the counterparts be separated into unigrams and bigrams (using `splitStrings`)?
`sep`	Separator to be passed to `splitStrings` to specify where to split the strings. Only used when `splitstrings = T`, ignored otherwise.
`bigram.binder`	Separator to be passed to `splitStrings` to be inserted between the parts of the bigrams
`grapheme.binder`	Separator to be used to separate a grapheme from the language name. Graphemes are language-specific symbols (i.e. the 'a' in the one language is not assumed to be the same as the 'a' from another language).
`simplify`	Should the output be reduced to the most important matrices only, with the row and columns names included in the matrices? Defaults to `simplify = F`, separating everything into different object. See Value below for details on the format of the results.

Details

The meanings that are selected for a wordlist are called CONCEPTS here, and the translations into the various languages COUNTERPARTS (following Poornima & Good 2010). The languages are called DOCULECTS (‘documented lects’) to generalize over their status as dialects, languages, or even small families (following Cysouw & Good 2013).

Value

There are four different possible outputs, depending on the option chosen.

By default, when splitstrings = T, simplify = F, the following list of 15 objects is returned. It starts with 8 different character vectors, which are actually the row/column names of the following sparse pattern matrices. The naming of the objects is an attempt to make everything easy to remember.

`doculects`	Character vector with names of doculects in the data
`concepts`	Character vector with names of concepts in the data
`words`	Character vector with all words, i.e. unique counterparts per language. The same string in the same language is only included once, but an identical string occurring in different doculect is separately included for each doculects.
`segments`	Character vector with all unigram-tokens in order of appearance, including boundary symbols and gap symbols (see `splitStrings` for more information about the gap symbols)
`unigrams`	Character vector with all unique unigrams in the data
`bigrams`	Character vector with all unique bigrams in the data
`graphemes`	Character vector with all unique graphemes (i.e. combinations of unigrams+doculects) occurring in the data
`digraphs`	Character vector with all unique digraphs (i.e. combinations of bigrams+doculects) occurring in the data
`DW`	Sparse pattern matrix of class `ngCMatrix` linking doculects (D) to words (W)
`CW`	Sparse pattern matrix of class `ngCMatrix` linking concepts (C) to words (W)
`SW`	Sparse pattern matrix of class `ngCMatrix` linking all token-segments (S) to words (W)
`US`	Sparse pattern matrix of class `ngCMatrix` linking unigrams (U) to segments (S)
`BS`	Sparse pattern matrix of class `ngCMatrix` linking bigrams (B) to segments (S)
`GS`	Sparse pattern matrix of class `ngCMatrix` linking language-specific graphemes (G) to segments (S)
`TS`	Sparse pattern matrix of class `ngCMatrix` linking digraphs (T, as no other letter was available) to segments (S)

When splitstrings = F, simplify = F, only the following objects from the above list are returned:

`doculects`	Character vector with names of doculects in the data
`concepts`	Character vector with names of concepts in the data
`words`	Character vector with all words, i.e. unique counterparts per language. The same string in the same language is only included once, but an identical string occurring in different doculect is separately included for each doculects.
`DW`	Sparse pattern matrix of class `ngCMatrix` linking doculects (D) to words (W)
`CW`	Sparse pattern matrix of class `ngCMatrix` linking concepts (C) to words (W)

When splitstrings = T, simplify = T only the bigram-separation is returned, and all row and columns names are included into the matrices. However, for reasons of space, the words vector is only included once:

`DW`	Sparse pattern matrix of class `ngCMatrix` linking doculects (D) to words (W). Doculects are in the rownames, colnames are left empty.
`CW`	Sparse pattern matrix of class `ngCMatrix` linking concepts (C) to words (W). Concepts are in the rownames, colnames are left empty.
`BW`	Sparse pattern matrix of class `ngCMatrix` linking bigrams (B) to words (W). Bigrams (note: not digraphs!) are in the rownames. This matrix includes all words as colnames.

Finally, when splitstrings = F, simplify = T, only the following subset of the above is returned.

`DW`	Sparse pattern matrix of class `ngCMatrix` linking doculects (D) to words (W). Doculects are in the rownames, colnames are left empty.
`CW`	Sparse pattern matrix of class `ngCMatrix` linking concepts (C) to words (W). Concepts are in the rownames, colnames are left empty.

Note

Note that the default behavior probably overgenerates information (specifically when splitstrings = T), and might be performing unnecessary computation for specific goals. In practice, it might be useful to tweak the underlying code (mainly by throwing out unnecessary steps) to optimize performance.

Author(s)

Michael Cysouw

References

Cysouw, Michael & Jeff Good. 2013. Languoid, Doculect, Glossonym: Formalizing the notion “language”. Language Documentation and Conservation 7. 331-359.

Poornima, Shakthi & Jeff Good. 2010. Modeling and Encoding Traditional Wordlists for Machine Applications. Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground.

Examples

# ----- load data -----

# an example wordlist, see the help(huber) for details
data(huber)

# ----- show output -----

# a selection, to see the result of splitWordlist
# only show the simplified output here, 
# the full output is rather long even for just these six words
sel <- c(1:3, 1255:1258)
splitWordlist(huber[sel,], simplify = TRUE)

# ----- split complete data -----

# splitting the complete wordlist is a lot of work !
# it won't get much quicker than this
# most time goes into the string-splitting of the almost 26,000 words
# Default version, included splitStrings:
system.time( H <- splitWordlist(huber) )

# Simplified version without splitStrings is much quicker:
system.time( H <- splitWordlist(huber, splitstrings = FALSE, simplify = TRUE) )

# ----- investigate colexification -----

# The simple version can be used to check how often two concepts 
# are expressed identically across all languages ('colexification')
H <- splitWordlist(huber, splitstrings = FALSE, simplify = TRUE)
sim <- tcrossprod(H$CW*1)

# select only the frequent colexifications for a quick visualisation
diag(sim) <- 0
sim <- drop0(sim, tol = 5)
sim <- sim[rowSums(sim) > 0, colSums(sim) > 0]

## Not run: 
# this might lead to errors on some platforms because of non-ASCII symbols
plot( hclust(as.dist(-sim), method = "average"), cex = .5)

## End(Not run)

# ----- investigate regular sound correspondences -----

# One central problem with data from many languages is the variation of orthography
# It is preferred to solve that problem separately
# e.g. check the column "TOKENS" in the huber data
# This is a grapheme-separated version of the data.
# can be used to investigate co-occurrence of graphemes (approx. phonemes)
H <- splitWordlist(huber, counterparts = "TOKENS", sep = " ")

# co-occurrence of all pairs of the 2150 different graphemes through all languages
system.time( G <- assocSparse( (H$CW*1) %*% t(H$SW*1) %*% t(H$GS*1), method = poi))
rownames(G) <- colnames(G) <- H$graphemes
G <- drop0(G, tol = 1)

# select only one language pair for a quick visualisation
# check the nice sound changes between bora and muinane!
GD <- H$GS %*% H$SW %*% t(H$DW)
colnames(GD) <- H$doculects
correspondences <- G[GD[,"bora"],GD[,"muinane"]]

## Not run: 
# this might lead to errors on some platforms because of non-ASCII symbols
heatmap(as.matrix(correspondences))

## End(Not run)

[Package qlcMatrix version 0.9.8 Index]