| vec2xxx {zipfR} | R Documentation |
Type-Token Statistics for Samples and Empirical Data (zipfR)
Description
Compute type-frequency list, frequency spectrum and vocabulary growth curve from a token vector representing a random sample or an observed sequence of tokens.
Usage
vec2tfl(x)
vec2spc(x)
vec2vgc(x, steps=200, stepsize=NA, m.max=0)
Arguments
x |
a vector of length |
steps |
number of steps for which vocabulary growth data
|
stepsize |
alternative way of specifying the steps of the
vocabulary growth curve. In this case, vocabulary growth data will
be calculated every |
m.max |
an integer in the range $1 ... 9$, specifying how many
spectrum elements |
Details
There are two main applications for the vec2xxx functions:
- a)
They can be used to calculate type-token statistics and vocabulary growth curves for random samples generated from a LNRE model (with the
rlnrefunction).- b)
They provide an easy way to process a user's own data without having to rely on external scripts to compute frequency spectra and vocabulary growth curves. All that is needed is a text file in one-token-per-line formt (i.e. where each token is given on a separate line). See "Examples" below for further hints.
Both applications work well for samples of up to approx. 1 million
tokens. For considerably larger data sets, specialized external
software should be used, such as the Perl scripts provided on the
zipfR homepage.
Value
An object of class tfl, spc or vgc, representing
the type frequency list, frequency spectrum or vocabulary growth curve
of the token vector x, respectively.
See Also
tfl, spc and vgc for more
information about type frequency lists, frequency spectra and
vocabulary growth curves
rlnre for generating random samples (in the form of the
required token vectors) from a LNRE model
readLines and scan for loading token
vectors from disk files
Examples
## type-token statistics for random samples from a LNRE distribution
model <- lnre("fzm", alpha=.5, A=1e-6, B=.05)
x <- rlnre(model, 100000)
vec2tfl(x)
vec2spc(x) # same as tfl2spc(vec2tfl(x))
vec2vgc(x)
sample.spc <- vec2spc(x)
exp.spc <- lnre.spc(model, 100000)
plot(exp.spc, sample.spc)
sample.vgc <- vec2vgc(x, m.max=1, steps=500)
exp.vgc <- lnre.vgc(model, N=N(sample.vgc), m.max=1)
plot(exp.vgc, sample.vgc, add.m=1)
## Not run:
## load token vector from a file in one-token-per-line format
x <- readLines(filename)
x <- readLines(file.choose()) # with file selection dialog
## you can also perform whitespace tokenization and filter the data
brown <- scan("brown.pos", what=character(0), quote="")
nouns <- grep("/NNS?$", brown, value=TRUE)
plot(vec2spc(nouns))
plot(vec2vgc(nouns, m.max=1), add.m=1)
## End(Not run)