| lnre.bootstrap {zipfR} | R Documentation |
Parametric bootstrapping for LNRE models (zipfR)
Description
This function implements parametric bootstrapping for LNRE models, i.e. it draws a specified number of random samples from the population described by a given lnre object. For each sample, two callback functions are applied to perform transformations and/or extract statistics. In an important application (bootstrapped confidence intervals for model parameters), the first callback estimates a new LNRE model and the second callback extracts the relevant parameters from this model. See ‘Use Cases’ and ‘Examples’ below for other use cases.
Usage
lnre.bootstrap(model, N, ESTIMATOR, STATISTIC,
replicates=100, sample=c("spc", "tfl", "tokens"),
simplify=TRUE, verbose=TRUE, parallel=1L, seed=NULL, ...)
Arguments
model |
a trained LNRE model, i.e. an object belonging to a subclass of |
N |
a single positive integer, specifying the size |
ESTIMATOR |
a callback function, normally used for estimating LNRE models in the bootstrap procedure. It is called once for each bootstrap sample with the sample as first argument (in the form determined by |
STATISTIC |
a callback function, normally used to extract model parameters and other relevant statistics from the bootstrapped LNRE models. It is called once for each bootstrap sample, with the value returned by |
replicates |
a single positive integer, specifying the number of bootstrap samples to be generated |
sample |
the form in which each sample is passed to Alternatively, a callback function that will be invoked with arguments |
simplify |
if |
verbose |
if |
parallel |
whether to enable parallel processing. Either an integer specifying the number of worker processes to be forked, or a pre-initialised snow cluster created with |
seed |
a single integer value used to initialize the RNG in order to generate reproducible results |
... |
any further arguments are passed through to the |
Details
The parametric bootstrapping procedure works as follows:
-
replicatesrandom samples ofNtokens each are drawn from the population described by the LNRE modelmodel(possibly using a callback function provided in argumentsample) Each sample is passed to the callback function
ESTIMATORin the form determined bysample(a frequency spectrum, type-frequency list, or factor vector of tokens). IfESTIMATORfails, it is re-run with a different sample, otherwise the return value is passed on toSTATISTIC. UseESTIMATOR=identityto pass the original sample through toSTATISTIC.The callback function
STATISTICis used to extract relevant information for each sample. IfSTATISTICfails, the procedure is repeated from step 2 with a different sample. The callback will typically return a vector of fixed length or a single-row data frame, and the results for all bootstrap samples are combined into a matrix or data frame ifsimplify=TRUE.
Warning: Keep in mind that sampling a token vector can be slow and consume large amounts of memory for very large N (several million tokens). If possible, use sample="spc" or sample="tfl", which can be generated more efficiently.
Parallelisation
Since bootstrapping is a computationally expensive procedure, it is usually desirable to use parallel processing. lnre.bootstrap supports two types of parallelisation, based on the parallel package:
On Unix platforms, you can set
parallelto an integer number in order to fork the specified number of worker processes, utilising multiple cores on the same machine. ThedetectCoresfunction shows how many cores are available, but due to hyperthreading and memory contention, it is often better to setparallelto a smaller value. Note that forking may be unstable especially in a GUI environment, as explained on themcforkmanpage.On all platforms, you can pass a pre-initialised snow cluster in the
argument, which consists of worker processes on the same machine or on different machines. A suitable cluster can be created withmakeCluster; see the parallel package documentation for further information. It is your responsibility to set up the cluster so that all required data sets, packages and custom functions are available on the worker processes;lnre.bootstrapwill only ensure that the zipfR package itself is loaded.
Note that parallel processing is not enabled by default and will only be used if parallel is set accordingly.
Value
If simplify=FALSE, a list of length replicates containing the statistics obtained from each individual bootstrap sample. In addition, the following attributes are set:
-
N= sample size of the bootstrap replicates -
model= the LNRE model from which samples were generated -
errors= number of samples for which either theESTIMATORor theSTATISTICcallback produced an error
If simplify=TRUE, the statistics are combined with rbind(). This is performed unconditionally, so make sure that STATISTIC returns a suitable value for all samples, typically vectors of the same length or single-row data frames with the same columns.
The return value is usually a matrix or data frame with replicates rows. No additional attributes are set.
Use cases
- Bootstrapped confidence intervals for model parameters:
-
The
confintmethod for LNRE models uses bootstrapping to estimate confidence intervals for the model parameters.For this application,
ESTIMATOR=lnrere-estimates the LNRE model from each bootstrap sample. Configuration options such as the model type, cost function, etc. are passed as additional arguments in..., and the sample must be provided in the form of a frequency spectrum. The return values are successfully estimated LNRE models.STATISTICextracts the model parameters and other coefficients of interest (such as the population diversityS) from each model and returns them as a named vector or single-row data frame. The results are combined withsimplify=TRUE, then empirical confidence intervals are determined for each column. - Empirical sampling distribution of productivity measures:
-
For some of the more complex measures of productivity and lexical richness (see
productivity.measures), it is difficult to estimate the sampling distribution mathematically. In these cases, an empirical approximation can be obtained by parametric bootstrapping.The most convenient approach is to set
ESTIMATOR=productivity.measures, so the desired measures can be passed as an additional argumentmeasures=tolnre.bootstrap. The defaultsample="spc"is appropriate for most measures and is efficient enough to carry out the procedure for multiple sample sizes.Since the estimator already returns the required statistics for each sample in a suitable format, set
STATISTIC=identityandsimplify=TRUE. - Empirical prediction intervals for vocabulary growth curves:
-
Vocabulary growth curves can only be generated from token vectors, so set
sample="tokens"and keepNreasonably small.ESTIMATOR=vec2vgccompilesvgcobjects for the samples. Passstepsorstepsizeas desired and setm.maxif growth curves forV_1, V_2, \ldotsare desired.Either use
STATISTIC=identityandsimplify=FALSEto return a list ofvgcobjects, which can be plotted or processed further withsapply(). This strategy is particulary useful if one or moreV_mare desired in addition toV.Or use
STATISTIC=function (x) x$Vto extract y-coordinates for the growth curve and combine them into a matrix withsimplify=TRUE, so that prediction intervals can be computed directly. Note that the corresponding x-coordinates are not returned and have to be inferred fromNandstepsize. - Simulating non-randomness and mixture distributions:
-
More complex populations and non-random samples can be simulated by providing a user callback function in the
sampleargument. This callback is invoked with parametersmodelandnand has to return a sample of sizenin the format expected byESTIMATOR.For simulating non-randomness, the callback will typically use
rlnreto generate a random sample and then apply some transformation.For simulating mixture distributions, it will typically generate multiple samples from different populations and merge them; the proportion of tokens from each population should be determined by a multinomial random variable. Individual populations might consist of LNRE models, or a finite number of “lexicalised” types. Note that only a single LNRE model will be passed to the callback; any other parameters have to be injected as bound variables in a local function definition.
See Also
lnre for more information about LNRE models. The high-level estimator function lnre uses lnre.bootstrap to collect data for approximate confidence intervals; lnre.productivity.measures uses it to approximate the sampling distributions of productivity measures.
Examples
## parametric bootstrapping from realistic LNRE model
model <- lnre("zm", spc=ItaRi.spc) # has quite a good fit
## estimate distribution of V, V1, V2 for sample size N=1000
res <- lnre.bootstrap(model, N=1000, replicates=200,
ESTIMATOR=identity,
STATISTIC=function (x) c(V=V(x), V1=Vm(x,1), V2=Vm(x,2)))
bootstrap.confint(res, method="normal")
## compare with theoretical expectations (EV/EVm = center, VV/VVm = spread^2)
lnre.spc(model, 1000, m.max=2, variances=TRUE)
## lnre.bootstrap() also captures and ignores occasional failures
res <- lnre.bootstrap(model, N=1000, replicates=200,
ESTIMATOR=function (x) if (runif(1) < .2) stop() else x,
STATISTIC=function (x) c(V=V(x), V1=Vm(x,1), V2=Vm(x,2)))
## empirical confidence intervals for vocabulary growth curve
## (this may become expensive because token-level samples have to be generated)
res <- lnre.bootstrap(model, N=1000, replicates=200, sample="tokens",
ESTIMATOR=vec2vgc, stepsize=100, # extra args passed to ESTIMATOR
STATISTIC=V) # extract vocabulary sizes at equidistant N
bootstrap.confint(res, method="normal")
## parallel processing is highly recommended for expensive bootstrapping
library(parallel)
## adjust number of processes according to available cores on your machine
cl <- makeCluster(2) # PSOCK cluster, should work on all platforms
res <- lnre.bootstrap(model, N=1e4, replicates=200, sample="tokens",
ESTIMATOR=vec2vgc, stepsize=1000, STATISTIC=V,
parallel=cl) # use cluster for parallelisation
bootstrap.confint(res, method="normal")
stopCluster(cl)
## on MacOS / Linux, simpler fork-based parallelisation also works well
## Not run:
res <- lnre.bootstrap(model, N=1e5, replicates=400, sample="tokens",
ESTIMATOR=vec2vgc, stepsize=1e4, STATISTIC=V,
parallel=8) # if you have enough cores ...
bootstrap.confint(res, method="normal")
## End(Not run)