readPDF {tm} | R Documentation |
Read In a PDF Document
Description
Return a function which reads in a portable document format (PDF) document extracting both its text and its metadata.
Usage
readPDF(engine = c("pdftools", "xpdf", "Rpoppler",
"ghostscript", "Rcampdf", "custom"),
control = list(info = NULL, text = NULL))
Arguments
engine |
a character string for the preferred PDF extraction engine (see Details). |
control |
a list of control options for the engine with the named
components |
Details
Formally this function is a function generator, i.e., it returns a function
(which reads in a text document) with a well-defined signature, but can access
passed over arguments (e.g., the preferred PDF extraction
engine
and control
options) via lexical scoping.
Available PDF extraction engines are as follows.
"pdftools"
(default) Poppler PDF rendering library as provided by the functions
pdf_info
andpdf_text
in package pdftools."xpdf"
command line
pdfinfo
andpdftotext
executables which must be installed and accessible on your system. Suitable utilities are provided by the Xpdf (http://www.xpdfreader.com/) PDF viewer or by the Poppler (https://poppler.freedesktop.org/) PDF rendering library."Rpoppler"
Poppler PDF rendering library as provided by the functions
PDF_info
andPDF_text
in package Rpoppler."ghostscript"
Ghostscript using ‘pdf_info.ps’ and ‘ps2ascii.ps’.
"Rcampdf"
Perl CAM::PDF PDF manipulation library as provided by the functions
pdf_info
andpdf_text
in package Rcampdf, available from the repository at http://datacube.wu.ac.at."custom"
custom user-provided extraction engine.
Control parameters for engine "xpdf"
are as follows.
info
a character vector specifying options passed over to the
pdfinfo
executable.text
a character vector specifying options passed over to the
pdftotext
executable.
Control parameters for engine "custom"
are as follows.
info
a function extracting metadata from a PDF. The function must accept a file path as first argument and must return a named list with the components
Author
(as character string),CreationDate
(of classPOSIXlt
),Subject
(as character string),Title
(as character string), andCreator
(as character string).text
a function extracting content from a PDF. The function must accept a file path as first argument and must return a character vector.
Value
A function
with the following formals:
elem
a named list with the component
uri
which must hold a valid file name.language
a string giving the language.
id
Not used.
The function returns a PlainTextDocument
representing the text
and metadata extracted from elem$uri
.
See Also
Reader
for basic information on the reader infrastructure
employed by package tm.
Examples
uri <- paste0("file://",
system.file(file.path("doc", "tm.pdf"), package = "tm"))
engine <- if(nzchar(system.file(package = "pdftools"))) {
"pdftools"
} else {
"ghostscript"
}
reader <- readPDF(engine)
pdf <- reader(elem = list(uri = uri), language = "en", id = "id1")
cat(content(pdf)[1])
VCorpus(URISource(uri, mode = ""),
readerControl = list(reader = readPDF(engine = "ghostscript")))