WALS {qlcMatrix} | R Documentation |
The World Atlas of Language Structures (WALS)
Description
The World Atlas of Language Structures (WALS) is a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) by a team of 55 authors.
The first version of WALS was published as a book with CD-ROM in 2005 by Oxford University Press. The first online version was published in April 2008. The second online version was published in April 2011. The current dataset is WALS 2013, published on 14 November 2013.
The included dataset wals
takes a somewhat sensible selection from the complete WALS data. It excludes attributes ("features" in WALS-parlance) that are definitially duplicates of others (3, 25, 95, 96, 97), those attributes that only list languages that are incompatible with other attributes (132, 133, 134, 135, 139, 140, 141, 142), and the ‘additional’ attributes that are marked as ‘B’ through ‘Z’. Further, it removes those languages that do not have any data left after removing those attributes. The result is a dataset with 2566 languages and 131 attributes.
Usage
data(wals)
Format
A list with two dataframes:
data
the actual WALS data. The object
wals$data
contains a dataframe with data from 2566 languages on 131 different attributes. The column names identify the WALS features. For details about these features, see https://wals.info/chaptermeta
some metadata for the languages. The object
wals$meta
contains a dataframe with some limited meta-information about these 2566 languages.
The three-letter WALS-codes are used as rownames in both dataframes. Further, the object wals$meta
contains the following variables.
name
a character vector giving a name for each language
genus
a factor with 522 levels with the genera according to M. Dryer
family
a factor with 215 levels with the families according to M. Dryer
longitude
a numeric vector with geo coordinates for all languages
latitude
a numeric vector with geo coordinates for all languages
Details
All details about the meaning of the variables and much more meta-information is available at https://wals.info.
Source
The current data was downloaded from https://wals.info in May 2014. The data is licensed as https://creativecommons.org/licenses/by-nc-nd/2.0/de/deed.en. Some minor corrections on the metadata have been performed (naming of variables, addition of missing coordinates).
References
Dryer, Matthew S. & Haspelmath, Martin (eds.) 2013. The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at https://wals.info, Accessed on 2013-11-14.)
Examples
data(wals)
# plot all locations of the WALS languages, looks like a world map
plot(wals$meta[,4:5])
# turn the large and mostly empty dataframe into sparse matrices
# recoding is nicely optimized and quick for this reasonably large dataset
# this works perfect as long as things stay within available RAM of the computer
system.time(
W <- splitTable(wals$data)
)
# as an aside: note that the recoding takes only about 30% of the space
as.numeric( object.size(W) / object.size(wals$data) )
# compute similarities (Chuprov's T, similar to Cramer's V)
# between all pairs of variables using sparse Matrix methods
system.time(sim <- sim.att(wals$data, method = "chuprov"))
# some structure visible
rownames(sim) <- colnames(wals$data)
plot(hclust(as.dist(1-sim), method = "ward"), cex = 0.5)