TextReuseCorpus {textreuse} | R Documentation |
TextReuseCorpus
Description
This is the constructor function for a TextReuseCorpus
, modeled on the
virtual S3 class Corpus
from the tm
package. The
object is a TextReuseCorpus
, which is basically a list containing
objects of class TextReuseTextDocument
. Arguments are passed
along to that constructor function. To create the corpus, you can pass either
a character vector of paths to text files using the paths =
parameter,
a directory containing text files (with any extension) using the dir =
parameter, or a character vector of documents using the text =
parameter, where each element in the characer vector is a document. If the
character vector passed to text =
has names, then those names will be
used as the document IDs. Otherwise, IDs will be assigned to the documents.
Only one of the paths
, dir
, or text
parameters should be
specified.
Usage
TextReuseCorpus(
paths,
dir = NULL,
text = NULL,
meta = list(),
progress = interactive(),
tokenizer = tokenize_ngrams,
...,
hash_func = hash_string,
minhash_func = NULL,
keep_tokens = FALSE,
keep_text = TRUE,
skip_short = TRUE
)
is.TextReuseCorpus(x)
skipped(x)
Arguments
paths |
A character vector of paths to files to be opened. |
dir |
The path to a directory of text files. |
text |
A character vector (possibly named) of documents. |
meta |
A list with named elements for the metadata associated with this corpus. |
progress |
Display a progress bar while loading files. |
tokenizer |
A function to split the text into tokens. See
|
... |
Arguments passed on to the |
hash_func |
A function to hash the tokens. See
|
minhash_func |
A function to create minhash signatures of the document.
See |
keep_tokens |
Should the tokens be saved in the documents that are returned or discarded? |
keep_text |
Should the text be saved in the documents that are returned or discarded? |
skip_short |
Should short documents be skipped? (See details.) |
x |
An R object to check. |
Details
If skip_short = TRUE
, this function will skip very short or
empty documents. A very short document is one where there are two few words
to create at least two n-grams. For example, if five-grams are desired,
then a document must be at least six words long. If no value of n
is
provided, then the function assumes a value of n = 3
. A warning will
be printed with the document ID of each skipped document. Use
skipped()
to get the IDs of skipped documents.
This function will use multiple cores on non-Windows machines if the
"mc.cores"
option is set. For example, to use four cores:
options("mc.cores" = 4L)
.
See Also
Accessors for TextReuse objects.
Examples
dir <- system.file("extdata/legal", package = "textreuse")
corpus <- TextReuseCorpus(dir = dir, meta = list("description" = "Field Codes"))
# Subset by position or file name
corpus[[1]]
names(corpus)
corpus[["ca1851-match"]]