R: read a text file(s)

readtext {readtext}

R Documentation

read a text file(s)

Description

Read texts and (if any) associated document-level meta-data from one or more source files. The text source files come from the textual component of the files, and the document-level metadata ("docvars") come from either the file contents or filenames.

Usage

readtext(
  file,
  ignore_missing_files = FALSE,
  text_field = NULL,
  docid_field = NULL,
  docvarsfrom = c("metadata", "filenames", "filepaths"),
  dvsep = "_",
  docvarnames = NULL,
  encoding = NULL,
  source = NULL,
  cache = TRUE,
  verbosity = readtext_options("verbosity"),
  ...
)

Arguments

`file`	the complete filename(s) to be read. This is designed to automagically handle a number of common scenarios, so the value can be a "glob"-type wildcard value. Currently available filetypes are: Single file formats: `txt` plain text files: So-called structured text files, which describe both texts and metadata: For all structured text filetypes, the column, field, or node which contains the the text must be specified with the `text_field` parameter, and all other fields are treated as docvars. `json` data in some form of JavaScript Object Notation, consisting of the texts and optionally additional docvars. The supported formats are: a single JSON object per file line-delimited JSON, with one object per line line-delimited JSON, of the format produced from a Twitter stream. This type of file has special handling which simplifies the Twitter format into docvars. The correct format for each JSON file is automatically detected. `⁠csv,tab,tsv⁠` comma- or tab-separated values `html` HTML documents, including specialized formats from known sources, such as Nexis-formatted HTML. See the `source` parameter below. `xml` XML documents are supported – those of the kind that can be read by `xml2::read_xml()` and navigated through `xml2::xml_find_all()`. For xml files, an additional argument `collapse` may be passed through `...` that names the character(s) to use in appending different text elements together. `pdf` pdf formatted files, converted through pdftools. `odt` Open Document Text formatted files. `⁠doc, docx⁠` Microsoft Word formatted files. `rtf` Rich Text Files. Reading multiple files and file types: In addition, `file` can also not be a path to a single local file, but also combinations of any of the above types, such as: a wildcard value any valid pathname with a wildcard ("glob") expression that can be expanded by the operating system. This may consist of multiple file types. a URL to a remote which is downloaded then loaded `⁠zip,tar,tar.gz,tar.bz⁠` archive file, which is unzipped. The contained files must be either at the top level or in a single directory. Archives, remote URLs and glob patterns can resolve to any of the other filetypes, so you could have, for example, a remote URL to a zip file which contained Twitter JSON files.
`ignore_missing_files`	if `FALSE`, then if the file argument doesn't resolve to an existing file, then an error will be thrown. Note that this can happen in a number of ways, including passing a path to a file that does not exist, to an empty archive file, or to a glob pattern that matches no files.
`text_field`, `docid_field`	a variable (column) name or column number indicating where to find the texts that form the documents for the corpus and their identifiers. This must be specified for file types `.csv`, `.json`, and `.xls`/`.xlsx` files. For XML files, an XPath expression can be specified.
`docvarsfrom`	used to specify that docvars should be taken from the filenames, when the `readtext` inputs are filenames and the elements of the filenames are document variables, separated by a delimiter (`dvsep`). This allows easy assignment of docvars from filenames such as `1789-Washington.txt`, `1793-Washington`, etc. by `dvsep` or from meta-data embedded in the text file header (`headers`). If `docvarsfrom` is set to `"filepaths"`, consider the full path to the file, not just the filename.
`dvsep`	separator (a regular expression character string) used in filenames to delimit docvar elements if `docvarsfrom="filenames"` or `docvarsfrom="filepaths"` is used
`docvarnames`	character vector of variable names for `docvars`, if `docvarsfrom` is specified. If this argument is not used, default docvar names will be used (`docvar1`, `docvar2`, ...).
`encoding`	vector: either the encoding of all files, or one encoding for each files
`source`	used to specify specific formats of some input file types, such as JSON or HTML. Currently supported types are `"twitter"` for JSON and `"nexis"` for HTML.
`cache`	if `TRUE`, save remote file to a temporary folder. Only used when `file` is a URL.
`verbosity`	0: output errors only 1: output errors and warnings (default) 2: output a brief summary message 3: output detailed file-related messages
`...`	additional arguments passed through to low-level file reading function, such as `file()`, `fread()`, etc. Useful for specifying an input encoding option, which is specified in the same was as it would be give to `iconv()`. See the Encoding section of file for details.

Value

a data.frame consisting of a columns doc_id and text that contain a document identifier and the texts respectively, with any additional columns consisting of document-level variables either found in the file containing the texts, or created through the readtext call.

Examples

## Not run: 
## get the data directory
if (!interactive()) pkgload::load_all()
DATA_DIR <- system.file("extdata/", package = "readtext")

## read in some text data
# all UDHR files
(rt1 <- readtext(paste0(DATA_DIR, "/txt/UDHR/*")))

# manifestos with docvars from filenames
(rt2 <- readtext(paste0(DATA_DIR, "/txt/EU_manifestos/*.txt"),
                 docvarsfrom = "filenames", 
                 docvarnames = c("unit", "context", "year", "language", "party"),
                 encoding = "LATIN1"))
                 
# recurse through subdirectories
(rt3 <- readtext(paste0(DATA_DIR, "/txt/movie_reviews/*"), 
                 docvarsfrom = "filepaths", docvarnames = "sentiment"))

## read in csv data
(rt4 <- readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv")))

## read in tab-separated data
(rt5 <- readtext(paste0(DATA_DIR, "/tsv/dailsample.tsv"), text_field = "speech"))

## read in JSON data
(rt6 <- readtext(paste0(DATA_DIR, "/json/inaugural_sample.json"), text_field = "texts"))

## read in pdf data
# UNHDR
(rt7 <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"), 
                 docvarsfrom = "filenames", 
                 docvarnames = c("document", "language")))
Encoding(rt7$text)

## read in Word data (.doc)
(rt8 <- readtext(paste0(DATA_DIR, "/word/*.doc")))
Encoding(rt8$text)

## read in Word data (.docx)
(rt9 <- readtext(paste0(DATA_DIR, "/word/*.docx")))
Encoding(rt9$text)

## use elements of path and filename as docvars
(rt10 <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"), 
                  docvarsfrom = "filepaths", dvsep = "[/_.]"))

## End(Not run)

[Package readtext version 0.91 Index]