simulated.wikipedia {corpora} | R Documentation |
Simulated type and token counts for Wikipedia articles (corpora)
Description
This function generates type and token counts, token-type ratios (TTR) and average word length for simulated articles from the English Wikipedia. Simulation paramters are based on data from the Wackypedia corpus.
The generated data set is usually named WackypediaStats
(see code examples below)
and is used for various exercises and illustrations in the SIGIL course.
Usage
simulated.wikipedia(N=1429649, length=c(100,1000), seed.rng=42)
Arguments
N |
population size, i.e. total number of Wikipedia articles |
length |
a numeric vector of length 2, specifying the typical range of Wikipedia article lengths |
seed.rng |
seed for the random number generator, so data sets with the same parameters ( |
Details
The default population size corresponds to the subset of the Wackypedia corpus from which the simulation parameters were obtained. This excludes all articles with extreme type-token statistics (very short, very long, extremely long words, etc.).
Article lengths are sampled from a lognormal distribution which is scaled so that the
central 95% of the values fall into the range specified by the length
argument.
The simulated data are surprising close to the original Wackypedia statistics.
Value
A data frame with N
rows corresponding to Wikipedia articles and the following columns:
tokens
:number of word tokens in the article
types
:number of distinct word types in the article
ttr
:token-type ratio (TTR) for the article
avglen
:average word length in characters (averaged across tokens)
Author(s)
Stephanie Evert (https://purl.org/stephanie.evert)
References
The Wackypedia corpus can be obtained from https://wacky.sslmit.unibo.it/doku.php?id=corpora.
Examples
WackypediaStats <- simulated.wikipedia()
summary(WackypediaStats)