R: Read In a PDF Document

readPDF {tm}

R Documentation

Read In a PDF Document

Description

Return a function which reads in a portable document format (PDF) document extracting both its text and its metadata.

Usage

readPDF(engine = c("pdftools", "xpdf", "Rpoppler",
                   "ghostscript", "Rcampdf", "custom"),
        control = list(info = NULL, text = NULL))

Arguments

`engine`	a character string for the preferred PDF extraction engine (see Details).
`control`	a list of control options for the engine with the named components `info` and `text` (see Details).

Details

Formally this function is a function generator, i.e., it returns a function (which reads in a text document) with a well-defined signature, but can access passed over arguments (e.g., the preferred PDF extraction engine and control options) via lexical scoping.

Available PDF extraction engines are as follows.

"pdftools": (default) Poppler PDF rendering library as provided by the functions pdf_info and pdf_text in package pdftools.
"xpdf": command line pdfinfo and pdftotext executables which must be installed and accessible on your system. Suitable utilities are provided by the Xpdf (http://www.xpdfreader.com/) PDF viewer or by the Poppler (https://poppler.freedesktop.org/) PDF rendering library.
"Rpoppler": Poppler PDF rendering library as provided by the functions PDF_info and PDF_text in package Rpoppler.
"ghostscript": Ghostscript using ‘pdf_info.ps’ and ‘ps2ascii.ps’.
"Rcampdf": Perl CAM::PDF PDF manipulation library as provided by the functions pdf_info and pdf_text in package Rcampdf, available from the repository at http://datacube.wu.ac.at.
"custom": custom user-provided extraction engine.

Control parameters for engine "xpdf" are as follows.

info: a character vector specifying options passed over to the pdfinfo executable.
text: a character vector specifying options passed over to the pdftotext executable.

Control parameters for engine "custom" are as follows.

info: a function extracting metadata from a PDF. The function must accept a file path as first argument and must return a named list with the components Author (as character string), CreationDate (of class POSIXlt), Subject (as character string), Title (as character string), and Creator (as character string).
text: a function extracting content from a PDF. The function must accept a file path as first argument and must return a character vector.

Value

A function with the following formals:

elem: a named list with the component uri which must hold a valid file name.
language: a string giving the language.
id: Not used.

The function returns a PlainTextDocument representing the text and metadata extracted from elem$uri.

Examples

uri <- paste0("file://",
              system.file(file.path("doc", "tm.pdf"), package = "tm"))
engine <- if(nzchar(system.file(package = "pdftools"))) {
    "pdftools" 
} else {
    "ghostscript"
}
reader <- readPDF(engine)
pdf <- reader(elem = list(uri = uri), language = "en", id = "id1")
cat(content(pdf)[1])
VCorpus(URISource(uri, mode = ""),
        readerControl = list(reader = readPDF(engine = "ghostscript")))

[Package tm version 0.7-13 Index]