productivity.measures {zipfR} | R Documentation |
Measures of Productivity and Lexical Richness (zipfR)
Description
Compute various measures of productivity and lexical richness from an observed frequency spectrum or type-frequency list, from an observed vocabulary growth curve, or from a vector of tokens.
Usage
productivity.measures(obj, measures, data.frame=TRUE, ...)
## S3 method for class 'tfl'
productivity.measures(obj, measures, data.frame=TRUE, ...)
## S3 method for class 'spc'
productivity.measures(obj, measures, data.frame=TRUE, ...)
## S3 method for class 'vgc'
productivity.measures(obj, measures, data.frame=TRUE, ...)
## Default S3 method:
productivity.measures(obj, measures, data.frame=TRUE, ...)
Arguments
obj |
a suitable data object from which productivity measures
can be computed. Currently either a frequency spectrum
(of class |
measures |
character vector naming the productivity measures to be computed (see "Productivity Measures" below). Names may be abbreviated as long as they remain unique. If unspecified, all supported measures are computed. |
data.frame |
if |
... |
additional arguments passed on to the method implementations (currently, no further arguments are recognized) |
Details
This function computes productivity measures based on an observed frequency spectrum, type-frequency list or vocabulary growth curve.
If an expected spectrum or VGC is passed, the expectations E[V]
, E[V_m]
will simply be substituted for the sample values V
, V_m
in the equations. In most cases, this does not yield the expected value of the productivity measure!
Some measures can only be computed from a complete frequency spectrum. They will return NA
if obj
is an incomplete spectrum or type-frequency list, an expected spectrum or a vocabulary growth curve is passed.
Some other measures can only be computed is a sufficient number of spectrum elements is included in a vocabulary growth curve (usually at least
V_1
and V_2
), and will return NA
otherwise.
Such limitations are indicated in the list of measures below (unless spectrum elements V_1
and V_2
are sufficient).
Value
If obj
is a frequency spectrum, type-frequency list or token vector:
A numeric vector of the same length as measures
with the corresponding observed values of the productivity measures.
If data.frame=TRUE
(the default), a single-row data frame is returned.
If obj
is a vocabulary growth curve:
A numeric matrix with columns corresponding to the selected productivity measures and rows corresponding to the sample sizes of the vocabulary growth curve.
If data.frame=TRUE
(the default), the matrix is converted to a data frame.
Productivity Measures
The following productivity measures are currently supported:
V
:-
the total number of types
V
TTR
:-
the type-token ratio TTR =
V / N
R
:-
Guiraud's (1954)
R = V / \sqrt{N}
. An equivalent measure is Carroll's (1964)CTTR = R / \sqrt{2}
. C
:-
Herdan's (1964)
C = \frac{ \log V }{ \log N }
k
:-
Dugast's (1979)
k = \frac{ \log V }{ \log \log N}
U
:-
Dugast's (1978, 1979)
U = \frac{ (\log N)^2 }{ \log N - \log V}
. Maas (1972) proposed an equivalent measurea^2 = 1 / U
. W
:-
Brunet's (1978)
W = N ^ {V ^ {-a}}
witha = 0.172
. P
:-
Baayen's (1991) productivity index
P = \frac{V_1}{N}
, which corresponds to the slope of the vocabulary growth curve (under random sampling assumptions) Hapax
:-
the proportion of hapax legomena
\frac{V_1}{V}
is a direct estimate for the parameter\alpha = 1 / a
of a population following the Zipf-Mandelbrot law (Evert 2004b: 130). H
:-
Honoré's (1979)
H = 100 \frac{ \log N }{ 1 - V_1 / V }
, a transformation of the proportion of hapax legomena adjusted for sample size S
:-
Sichel's (1975)
S = V_2 / V
, i.e. the proportion of dis legomena. Michéa's (1969, 1971)M = 1 / S
is an equivalent measure. alpha2
:-
Evert's
\alpha_2 = 1 - 2 \frac{V_2}{V_1}
is another direct estimate for the parameter\alpha = 1 / a
of a Zipf-Mandelbrot population (Evert 2004b: 127). K
:-
Yule's (1944)
K = 10^4 \cdot \frac{ \sum_m m^2 V_m - N}{ N^2 }
(only for complete frequency spectrum or type-frequency list). Herdan (1955) proposes an almost equivalent measurev_m \approx \sqrt{K}
based on a different derviation. Both measures converge for largeN
andV
. Yule'sK
is almost identical to Simpson'sD
and is an unbiased estimator for the same population coefficient\delta
under an independent Poisson sampling scheme. A measure of lexical poverty, i.e. smaller values correpond to higher productivity. D
:-
Simpson's (1949)
D = \sum_m V_m \frac{m}{N}\cdot \frac{m-1}{N-1}
(only for complete frequency spectrum or type-frequency list) is a slightly modified version of Yule'sK
. This measure is an unbiased estimator for a population coefficient\delta
, representing the probability of picking the same type twice in two consecutive draws from the population. A measure of lexical poverty, i.e. smaller values correpond to higher productivity. Entropy
:-
Entropy of the sample frequency distribution
-\sum_m V_m \frac{m}{N} \log_2 \frac{m}{N}
(only for complete frequency spectrum or type-frequency list). This is not a reliable estimator of population entropy. It is therefore not recommended as a productivity measure and has only been included for evaluation studies. A measure of lexical poverty, i.e. smaller values correpond to higher productivity. eta
:-
Normalised entropy or evenness
\eta = \textrm{Entropy} / \log_2 V
(only for complete frequency spectrum or type-frequency list) where\log_2 V
is the largest possible value for a sample with the observed vocabulary size (obtained for a uniform distribution). Therefore,0 \le \eta \le 1
. Not recommended as a productivity measure because it is expected to produce erratic and counterintuitive results.
See Sec. 2.1 of the technical report Inside zipfR for further details and references.
References
Evert, Stefan (2004b). The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD Thesis, IMS, University of Stuttgart. URN urn:nbn:de:bsz:93-opus-23714 http://dx.doi.org/10.18419/opus-2556
See Also
lnre.productivity.measures
for parametric bootstrapping and approximate expectations
of productivity measures in random samples from a LNRE population.
Examples
rbind(
AllTexts=productivity.measures(Brown.spc),
Fiction=productivity.measures(BrownImag.spc),
NonFiction=productivity.measures(BrownInform.spc))
## can be applied to token vector, type-frequency list, or frequency spectrum
bar.vec <- EvertLuedeling2001$bar
bar1 <- productivity.measures(bar.vec) # token vector
bar2 <- productivity.measures(vec2tfl(bar.vec)) # type-frequency list
bar3 <- productivity.measures(vec2spc(bar.vec)) # frequency spectrum
print(rbind(tokens=bar1, tfl=bar2, spc=bar3))
## sample-size dependency of productivity measures in Brown corpus
## (note that only a subset of the measures can be computed)
n <- c(10e3, 50e3, 100e3, 200e3, 500e3, 1e6)
idx <- N(Brown.emp.vgc) %in% n
my.vgc <- vgc(N=N(Brown.emp.vgc)[idx],
V=V(Brown.emp.vgc)[idx],
Vm=list(Vm(Brown.emp.vgc, 1)[idx]))
print(my.vgc) # since we don't have a subset method for VGCs yet
productivity.measures(my.vgc)
productivity.measures(my.vgc, measures=c("TTR", "P")) # selected measures
## parametric bootstrapping to obtain sampling distribution of measures
## (much easier with ?lnre.productivity.measures)
model <- lnre("zm", spc=ItaRi.spc) # realistic LNRE model
res <- lnre.bootstrap(model, 1e6, ESTIMATOR=identity,
STATISTIC=productivity.measures)
bootstrap.confint(res, method="normal")