zipfR-package {zipfR} | R Documentation |
zipfR: lexical statistics in R
Description
The zipfR package performs Large-Number-of-Rare-Events (LNRE) modeling of (linguistic) type frequency distributions (Baayen 2001) and provides utilities to run various forms of lexical statistics analysis in R.
Details
The best way to get started with zipfR is to read the tutorial, which you can find as a package vignettte via the HTML documentation; you can also download it from https://zipfr.r-forge.r-project.org/#start
zipfR is released under the GNU General Public License (http://www.gnu.org/copyleft/gpl.html)
Author(s)
Stefan Evert <stefan.evert@fau.de> and Marco Baroni <marco.baroni@unitn.it>
Maintainer: Stefan Evert <stefan.evert@fau.de>
References
zipfR Website: https://zipfR.r-forge.r-project.org/
Baayen, R. Harald (2001). Word Frequency Distributions. Kluwer, Dordrecht.
Baroni, Marco (2008). Distributions in text. In: A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, article 37. Mouton de Gruyter, Berlin.
Evert, Stefan (2004). A simple LNRE model for random character sequences. Proceedings of JADT 2004, 411-422.
Evert, Stefan (2004b). The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD Thesis, IMS, University of Stuttgart. URN urn:nbn:de:bsz:93-opus-23714 http://dx.doi.org/10.18419/opus-2556
Evert, Stefan and Baroni, Marco (2006). Testing the extrapolation quality of word frequency models. Proceedings of Corpus Linguistics 2005.
Evert, Stefan and Baroni, Marco (2006). The zipfR library: Words and other rare events in R. useR! 2006: The second R user conference.
See Also
The zipfR tutorial: available as a package vignette and online from https://zipfr.r-forge.r-project.org/#start.
Some good entry points into the zipfR documentation are
be spc
, vgc
, tfl
,
read.spc
, read.tfl
,
read.vgc
, lnre
,
lnre.vgc
, plot.spc
,
plot.vgc
Harald Baayen's LEXSTATS tools, which implement a wider range of LNRE models: https://www.springer.com/de/book/9780792370178
Stefan Evert's UCS tools for collocation analysis, which include functions that have been integrated into zipfR: http://www.collocations.de/software.html
Examples
## load Oliver Twist and Great Expectations frequency spectra
data(DickensOliverTwist.spc)
data(DickensGreatExpectations.spc)
## check sample size and vocabulary and hapax counts
N(DickensOliverTwist.spc)
V(DickensOliverTwist.spc)
Vm(DickensOliverTwist.spc,1)
N(DickensGreatExpectations.spc)
V(DickensGreatExpectations.spc)
Vm(DickensGreatExpectations.spc,1)
## compute binomially interpolated growth curves
ot.vgc <- vgc.interp(DickensOliverTwist.spc,(1:100)*1570)
ge.vgc <- vgc.interp(DickensGreatExpectations.spc,(1:100)*1865)
## plot them
plot(ot.vgc,ge.vgc,legend=c("Oliver Twist","Great Expectations"))
## load Dickens' works frequency spectrum
data(Dickens.spc)
## compute Zipf-Mandelbrot model from Dickens data
## and look at model summary
zm <- lnre("zm",Dickens.spc)
zm
## plot observed and expected spectrum
zm.spc <- lnre.spc(zm,N(Dickens.spc))
plot(Dickens.spc,zm.spc)
## obtain expected V and V1 values at arbitrary sample sizes
EV(zm,1e+8)
EVm(zm,1,1e+8)
## generate expected V and V1 growth curves up to a sample size
## of 10 million tokens and plot them, with vertical line at
## estimation size
ext.vgc <- lnre.vgc(zm,(1:100)*1e+5,m.max=1)
plot(ext.vgc,N0=N(zm),add.m=1)