| etymology {languageR} | R Documentation |
Etymological age and regularity in Dutch
Description
Estimated etymological age for regular and irregular monomorphemic Dutch verbs, together with other distributional predictors of regularity.
Usage
data(etymology)
Format
A data frame with 285 observations on the following 14 variables.
Verba factor with the verbs as levels.
WrittenFrequencya numeric vector of logarithmically transformed frequencies in written Dutch (as available in the CELEX lexical database).
NcountStema numeric vector for the number of orthographic neighbors.
MeanBigramFrequencya numeric vector for mean log bigram frequency.
InflectionalEntropya numeric vector for Shannon's entropy calculated for the word's inflectional variants.
Auxiliarya factor with levels
hebben,zijnandzijnhebfor the verb's auxiliary in the perfect tenses.Regularitya factor with levels
irregularandregular.LengthInLettersa numeric vector of the word's orthographic length.
Denominativea factor with levels
DenandNspecifying whether a verb is derived from a noun according to the CELEX lexical database.FamilySizea numeric vector for the number of types in the word's morphological family.
EtymAgean ordered factor with levels
Dutch,DutchGerman,WestGermanic,GermanicandIndoEuropean.Valencya numeric vector for the verb's valency, estimated by its number of argument structures.
NVratioa numeric vector for the log-transformed ratio of the nominal and verbal frequencies of use.
WrittenSpokenRatioa numeric vector for the log-transformed ratio of the frequencies in written and spoken Dutch.
References
Baayen, R. H. and Moscoso del Prado Martin, F. (2005) Semantic density and past-tense formation in three Germanic languages, Language, 81, 666-698.
Tabak, W., Schreuder, R. and Baayen, R. H. (2005) Lexical statistics and lexical processing: semantic density, information complexity, sex, and irregularity in Dutch, in Kepser, S. and Reis, M., Linguistic Evidence - Empirical, Theoretical, and Computational Perspectives, Berlin: Mouton de Gruyter, pp. 529-555.
Examples
## Not run:
data(etymology)
# ---- EtymAge should be an ordered factor, set contrasts accordingly
etymology$EtymAge = ordered(etymology$EtymAge, levels = c("Dutch",
"DutchGerman", "WestGermanic", "Germanic", "IndoEuropean"))
options(contrasts=c("contr.treatment","contr.treatment"))
library(rms)
etymology.dd = datadist(etymology)
options(datadist = 'etymology.dd')
# ---- EtymAge as additional predictor for regularity
etymology.lrm = lrm(Regularity ~ WrittenFrequency +
rcs(FamilySize, 3) + NcountStem + InflectionalEntropy +
Auxiliary + Valency + NVratio + WrittenSpokenRatio + EtymAge,
data = etymology, x = TRUE, y = TRUE)
anova(etymology.lrm)
# ---- EtymAge as dependent variable
etymology.lrm = lrm(EtymAge ~ WrittenFrequency + NcountStem +
MeanBigramFrequency + InflectionalEntropy + Auxiliary +
Regularity + LengthInLetters + Denominative + FamilySize + Valency +
NVratio + WrittenSpokenRatio, data = etymology, x = TRUE, y = TRUE)
# ---- model simplification
etymology.lrm = lrm(EtymAge ~ NcountStem + Regularity + Denominative,
data = etymology, x = TRUE, y = TRUE)
validate(etymology.lrm, bw=TRUE, B=200)
# ---- plot partial effects and check assumptions ordinal regression
plot(Predict(etymology.lrm))
plot(etymology.lrm)
resid(etymology.lrm, 'score.binary', pl = TRUE)
plot.xmean.ordinaly(EtymAge ~ NcountStem, data = etymology)
## End(Not run)