vec2xxx {zipfR} | R Documentation |
Type-Token Statistics for Samples and Empirical Data (zipfR)
Description
Compute type-frequency list, frequency spectrum and vocabulary growth curve from a token vector representing a random sample or an observed sequence of tokens.
Usage
vec2tfl(x)
vec2spc(x)
vec2vgc(x, steps=200, stepsize=NA, m.max=0)
Arguments
x |
a vector of length |
steps |
number of steps for which vocabulary growth data
|
stepsize |
alternative way of specifying the steps of the
vocabulary growth curve. In this case, vocabulary growth data will
be calculated every |
m.max |
an integer in the range $1 ... 9$, specifying how many
spectrum elements |
Details
There are two main applications for the vec2xxx
functions:
- a)
They can be used to calculate type-token statistics and vocabulary growth curves for random samples generated from a LNRE model (with the
rlnre
function).- b)
They provide an easy way to process a user's own data without having to rely on external scripts to compute frequency spectra and vocabulary growth curves. All that is needed is a text file in one-token-per-line formt (i.e. where each token is given on a separate line). See "Examples" below for further hints.
Both applications work well for samples of up to approx. 1 million
tokens. For considerably larger data sets, specialized external
software should be used, such as the Perl scripts provided on the
zipfR
homepage.
Value
An object of class tfl
, spc
or vgc
, representing
the type frequency list, frequency spectrum or vocabulary growth curve
of the token vector x
, respectively.
See Also
tfl
, spc
and vgc
for more
information about type frequency lists, frequency spectra and
vocabulary growth curves
rlnre
for generating random samples (in the form of the
required token vectors) from a LNRE model
readLines
and scan
for loading token
vectors from disk files
Examples
## type-token statistics for random samples from a LNRE distribution
model <- lnre("fzm", alpha=.5, A=1e-6, B=.05)
x <- rlnre(model, 100000)
vec2tfl(x)
vec2spc(x) # same as tfl2spc(vec2tfl(x))
vec2vgc(x)
sample.spc <- vec2spc(x)
exp.spc <- lnre.spc(model, 100000)
plot(exp.spc, sample.spc)
sample.vgc <- vec2vgc(x, m.max=1, steps=500)
exp.vgc <- lnre.vgc(model, N=N(sample.vgc), m.max=1)
plot(exp.vgc, sample.vgc, add.m=1)
## Not run:
## load token vector from a file in one-token-per-line format
x <- readLines(filename)
x <- readLines(file.choose()) # with file selection dialog
## you can also perform whitespace tokenization and filter the data
brown <- scan("brown.pos", what=character(0), quote="")
nouns <- grep("/NNS?$", brown, value=TRUE)
plot(vec2spc(nouns))
plot(vec2vgc(nouns, m.max=1), add.m=1)
## End(Not run)