| readPDF {tm} | R Documentation | 
Read In a PDF Document
Description
Return a function which reads in a portable document format (PDF) document extracting both its text and its metadata.
Usage
readPDF(engine = c("pdftools", "xpdf", "Rpoppler",
                   "ghostscript", "Rcampdf", "custom"),
        control = list(info = NULL, text = NULL))
Arguments
| engine | a character string for the preferred PDF extraction engine (see Details). | 
| control | a list of control options for the engine with the named
components  | 
Details
Formally this function is a function generator, i.e., it returns a function
(which reads in a text document) with a well-defined signature, but can access
passed over arguments (e.g., the preferred PDF extraction
engine and control options) via lexical scoping.
Available PDF extraction engines are as follows.
- "pdftools"
- (default) Poppler PDF rendering library as provided by the functions - pdf_infoand- pdf_textin package pdftools.
- "xpdf"
- command line - pdfinfoand- pdftotextexecutables which must be installed and accessible on your system. Suitable utilities are provided by the Xpdf (http://www.xpdfreader.com/) PDF viewer or by the Poppler (https://poppler.freedesktop.org/) PDF rendering library.
- "Rpoppler"
- Poppler PDF rendering library as provided by the functions - PDF_infoand- PDF_textin package Rpoppler.
- "ghostscript"
- Ghostscript using ‘pdf_info.ps’ and ‘ps2ascii.ps’. 
- "Rcampdf"
- Perl CAM::PDF PDF manipulation library as provided by the functions - pdf_infoand- pdf_textin package Rcampdf, available from the repository at http://datacube.wu.ac.at.
- "custom"
- custom user-provided extraction engine. 
Control parameters for engine "xpdf" are as follows.
- info
- a character vector specifying options passed over to the - pdfinfoexecutable.
- text
- a character vector specifying options passed over to the - pdftotextexecutable.
Control parameters for engine "custom" are as follows.
- info
- a function extracting metadata from a PDF. The function must accept a file path as first argument and must return a named list with the components - Author(as character string),- CreationDate(of class- POSIXlt),- Subject(as character string),- Title(as character string), and- Creator(as character string).
- text
- a function extracting content from a PDF. The function must accept a file path as first argument and must return a character vector. 
Value
A function with the following formals:
- elem
- a named list with the component - uriwhich must hold a valid file name.
- language
- a string giving the language. 
- id
- Not used. 
The function returns a PlainTextDocument representing the text
and metadata extracted from elem$uri.
See Also
Reader for basic information on the reader infrastructure
employed by package tm.
Examples
uri <- paste0("file://",
              system.file(file.path("doc", "tm.pdf"), package = "tm"))
engine <- if(nzchar(system.file(package = "pdftools"))) {
    "pdftools" 
} else {
    "ghostscript"
}
reader <- readPDF(engine)
pdf <- reader(elem = list(uri = uri), language = "en", id = "id1")
cat(content(pdf)[1])
VCorpus(URISource(uri, mode = ""),
        readerControl = list(reader = readPDF(engine = "ghostscript")))