corpora-package {corpora} | R Documentation |
corpora: Statistical Inference from Corpus Frequency Data
Description
The corpora
package provides a collection of functions for statistical inference
from corpus frequency data, as well as some convenience functions and example data sets.
It is a companion package to the open-source course Statistical Inference: a Gentle Introduction for Linguists and similar creatures originally developed by Marco Baroni and Stephanie Evert. Statistical methods implemented in the package are described and illustrated in the units of this course.
Starting with version 0.6 the package also includes best-practice implementations of various corpus-linguistic analysis techniques.
Details
An overview of some important functions and data sets included in the corpora
package.
See the package index for a complete listing.
Analysis functions
-
keyness()
provides reference implementations for best-practice keyness measures, including the recommended LRC measure (Evert 2022) -
binom.pval()
is a vectorised function that computes p-values of the binomial test more efficiently thanbinom.test
(using central p-values in the two-sided case) -
fisher.pval()
is a vectorised function that efficiently computes p-values of Fisher's exact test on2\times 2
contingency tables for large samples (using central p-values in the two-sided case) -
prop.cint()
is a vectorised function that computes multiple binomial confidence intervals much more efficiently thanbinom.test
-
z.score()
andz.score.pval()
can be used to carry out a z-test for a single proportion (as an approximation tobinom.test
) -
chisq()
andchisq.pval()
are vectorised functions that compute the test statistic and p-value of a chi-squared test for2\times 2
contingency tables more efficiently thanchisq.test
Utility functions
-
cont.table()
creates2\times 2
contingency tables for frequency comparison test that can be passed tochisq.test
andfisher.test
-
sample.df()
extracts random samples of rows from a data frame -
qw()
splits a string on whitespace or a user-specified regular expression (similar to Perl'sqw//
construct) -
corpora.palette()
provides some nice colour palettes (better than R's default colours) -
rowVector()
andcolVector()
convert a vector into a single-row or single-column matrix
Data sets
Several data sets based on the British National Corpus, including complete metadata for all 4048 text files (
BNCmeta
), per-text frequency counts for a number of linguistic corpus queries (BNCqueries
), and relative frequencies of 65 lexico-grammatical features for each text (BNCbiber
)Frequency counts of passive constructions in all texts of the Brown and LOB corpora (
BrownLOBPassives
) for frequency comparison with regression models, complemented by distributional features (DistFeatBrownFam
) as additional predictorsA small text corpus of Very Short Stories in the form of a data frame
VSS
, with one row for each token in the corpus.Small example tables to illustrate frequency comparison of lexical items (
BNCcomparison
) and collocation analysis (BNCInChargeOf
)-
KrennPPV
is a data set of German verb-preposition-noun collocation candidates with manual annotation of true positives and pre-computed association scores Three functions for generating large synthetic data sets used in the SIGIL course:
simulated.census()
,simulated.language.course()
andsimulated.wikipedia()
Author(s)
Stephanie Evert (https://purl.org/stephanie.evert)
References
The official homepage of the corpora
package and the SIGIL course is http://SIGIL.R-Forge.R-Project.org/.