TextReuseTextDocument {textreuse} | R Documentation |
TextReuseTextDocument
Description
This is the constructor function for TextReuseTextDocument
objects.
This class is used for comparing documents.
Usage
TextReuseTextDocument(
text,
file = NULL,
meta = list(),
tokenizer = tokenize_ngrams,
...,
hash_func = hash_string,
minhash_func = NULL,
keep_tokens = FALSE,
keep_text = TRUE,
skip_short = TRUE
)
is.TextReuseTextDocument(x)
has_content(x)
has_tokens(x)
has_hashes(x)
has_minhashes(x)
Arguments
text |
A character vector containing the text of the document. This
argument can be skipped if supplying |
file |
The path to a text file, if |
meta |
A list with named elements for the metadata associated with this
document. If a document is created using the |
tokenizer |
A function to split the text into tokens. See
|
... |
Arguments passed on to the |
hash_func |
A function to hash the tokens. See
|
minhash_func |
A function to create minhash signatures of the document.
See |
keep_tokens |
Should the tokens be saved in the document that is returned or discarded? |
keep_text |
Should the text be saved in the document that is returned or discarded? |
skip_short |
Should short documents be skipped? (See details.) |
x |
An R object to check. |
Details
This constructor function follows a three-step process. It reads in
the text, either from a file or from memory. It then tokenizes that text.
Then it hashes the tokens. Most of the comparison functions in this package
rely only on the hashes to make the comparison. By passing FALSE
to
keep_tokens
and keep_text
, you can avoid saving those
objects, which can result in significant memory savings for large corpora.
If skip_short = TRUE
, this function will return NULL
for very
short or empty documents. A very short document is one where there are two
few words to create at least two n-grams. For example, if five-grams are
desired, then a document must be at least six words long. If no value of
n
is provided, then the function assumes a value of n = 3
. A
warning will be printed with the document ID of a skipped document.
Value
An object of class TextReuseTextDocument
. This object inherits
from the virtual S3 class TextDocument
in the NLP
package. It contains the following elements:
- content
The text of the document.
- tokens
The tokens created from the text.
- hashes
Hashes created from the tokens.
- minhashes
The minhash signature of the document.
- metadata
The document metadata, including the filename (if any) in
file
.
See Also
Accessors for TextReuse objects.
Examples
file <- system.file("extdata/legal/ny1850-match.txt", package = "textreuse")
doc <- TextReuseTextDocument(file = file, meta = list(id = "ny1850"))
print(doc)
meta(doc)
head(tokens(doc))
head(hashes(doc))
## Not run:
content(doc)
## End(Not run)