make.samples {stylo} | R Documentation |
Split text to samples
Description
Function that either splits an input text (a vector of linguistic items, such as words, word n-grams, character n-grams, etc.) into equal-sized samples of a desired length (expressed in words), or excerpts randomly a number of words from the original text.
Usage
make.samples(tokenized.text, sample.size = 10000,
sampling = "no.sampling", sample.overlap = 0,
number.of.samples = 1, sampling.with.replacement = FALSE)
Arguments
tokenized.text |
input textual data stored either in a form of vector (single text), or as a list of vectors (whole corpus); particular vectors should contain tokenized data, i.e. words, word n-grams, or other features, as elements. |
sample.size |
desired size of sample expressed in number of words; default value is 10,000. |
sampling |
one of three values: |
sample.overlap |
if this opion is used, a reference text is segmented
into consecutive, equal-sized samples that are allowed to partially
overlap. If one specifies the |
number.of.samples |
optional argument which will be used only if
|
sampling.with.replacement |
optional argument which will be used only
if |
Details
Normal sampling is probably a good choice when the input texts are
long: the advantage is that one gets a bigger number of samples which,
in a way, validate the results (when several independent samples excerpted
from one text are clustered together).
When the analyzed texts are significantly unequal in length, it is not
a bad idea to prepare samples as randomly chosen "bags of words". For this,
set the sampling
variable to random.sampling
. The desired
size of the sample should be specified via the sample.size
variable.
Sampling with and without replacement is also available. It has been shown
by Eder (2010) that harvesting random samples from original texts improves
the performance of authorship attribution methods.
Author(s)
Mike Kestemont, Maciej Eder
References
Eder, M. (2015). Does size matter? Authorship attribution, small samples, big problem. "Digital Scholarship in the Humanities", 30(2): 167-182.
See Also
txt.to.words
, txt.to.words.ext
,
txt.to.features
, make.ngrams
Examples
my.text = "Arma virumque cano, Troiae qui primus ab oris
Italiam fato profugus Laviniaque venit
litora, multum ille et terris iactatus et alto
vi superum, saevae memorem Iunonis ob iram,
multa quoque et bello passus, dum conderet urbem
inferretque deos Latio; genus unde Latinum
Albanique patres atque altae moenia Romae.
Musa, mihi causas memora, quo numine laeso
quidve dolens regina deum tot volvere casus
insignem pietate virum, tot adire labores
impulerit. tantaene animis caelestibus irae?"
my.words = txt.to.words(my.text)
# split the above text into samples of 20 words:
make.samples(my.words, sampling = "normal.sampling", sample.size = 20)
# excerpt randomly 50 words from the above text:
make.samples(my.words, sampling = "random.sampling", sample.size = 50)
# excerpt 5 random samples from the above text:
make.samples(my.words, sampling = "random.sampling", sample.size = 50,
number.of.samples = 5)