readtext {readtext} | R Documentation |
read a text file(s)
Description
Read texts and (if any) associated document-level meta-data from one or more source files. The text source files come from the textual component of the files, and the document-level metadata ("docvars") come from either the file contents or filenames.
Usage
readtext(
file,
ignore_missing_files = FALSE,
text_field = NULL,
docid_field = NULL,
docvarsfrom = c("metadata", "filenames", "filepaths"),
dvsep = "_",
docvarnames = NULL,
encoding = NULL,
source = NULL,
cache = TRUE,
verbosity = readtext_options("verbosity"),
...
)
Arguments
file |
the complete filename(s) to be read. This is designed to automagically handle a number of common scenarios, so the value can be a "glob"-type wildcard value. Currently available filetypes are: Single file formats:
Reading multiple files and file types: In addition,
|
ignore_missing_files |
if |
text_field , docid_field |
a variable (column) name or column number
indicating where to find the texts that form the documents for the corpus
and their identifiers. This must be specified for file types |
docvarsfrom |
used to specify that docvars should be taken from the
filenames, when the |
dvsep |
separator (a regular expression character string) used in
filenames to delimit docvar elements if |
docvarnames |
character vector of variable names for |
encoding |
vector: either the encoding of all files, or one encoding for each files |
source |
used to specify specific formats of some input file types, such
as JSON or HTML. Currently supported types are |
cache |
if |
verbosity |
|
... |
additional arguments passed through to low-level file reading
function, such as |
Value
a data.frame consisting of a columns doc_id
and text
that contain a document identifier and the texts respectively, with any
additional columns consisting of document-level variables either found
in the file containing the texts, or created through the
readtext
call.
Examples
## Not run:
## get the data directory
if (!interactive()) pkgload::load_all()
DATA_DIR <- system.file("extdata/", package = "readtext")
## read in some text data
# all UDHR files
(rt1 <- readtext(paste0(DATA_DIR, "/txt/UDHR/*")))
# manifestos with docvars from filenames
(rt2 <- readtext(paste0(DATA_DIR, "/txt/EU_manifestos/*.txt"),
docvarsfrom = "filenames",
docvarnames = c("unit", "context", "year", "language", "party"),
encoding = "LATIN1"))
# recurse through subdirectories
(rt3 <- readtext(paste0(DATA_DIR, "/txt/movie_reviews/*"),
docvarsfrom = "filepaths", docvarnames = "sentiment"))
## read in csv data
(rt4 <- readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv")))
## read in tab-separated data
(rt5 <- readtext(paste0(DATA_DIR, "/tsv/dailsample.tsv"), text_field = "speech"))
## read in JSON data
(rt6 <- readtext(paste0(DATA_DIR, "/json/inaugural_sample.json"), text_field = "texts"))
## read in pdf data
# UNHDR
(rt7 <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"),
docvarsfrom = "filenames",
docvarnames = c("document", "language")))
Encoding(rt7$text)
## read in Word data (.doc)
(rt8 <- readtext(paste0(DATA_DIR, "/word/*.doc")))
Encoding(rt8$text)
## read in Word data (.docx)
(rt9 <- readtext(paste0(DATA_DIR, "/word/*.docx")))
Encoding(rt9$text)
## use elements of path and filename as docvars
(rt10 <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"),
docvarsfrom = "filepaths", dvsep = "[/_.]"))
## End(Not run)