extract_text {tabulapdf} | R Documentation |
extract_text
Description
Extract text from a file
Usage
extract_text(
file,
pages = NULL,
area = NULL,
password = NULL,
encoding = NULL,
copy = FALSE
)
Arguments
file |
A character string specifying the path or URL to a PDF file. |
pages |
An optional integer vector specifying pages to extract from. |
area |
An optional list, of length equal to the number of pages specified, where each entry contains a four-element numeric vector of coordinates (top,left,bottom,right) containing the table for the corresponding page. As a convenience, a list of length 1 can be used to extract the same area from all (specified) pages. |
password |
Optionally, a character string containing a user password to access a secured PDF. |
encoding |
Optionally, a character string specifying an encoding for the text, to be passed to the assignment method of |
copy |
Specifies whether the original local file(s) should be copied to
|
Details
This function converts the contents of a PDF file into a single unstructured character string.
Value
If pages = NULL
(the default), a length 1 character vector, otherwise a vector of length length(pages)
.
Author(s)
Thomas J. Leeper <thosjleeper@gmail.com>
See Also
extract_tables
, extract_areas
, split_pdf
Examples
# simple demo file
f <- system.file("examples", "fortytwo.pdf", package = "tabulapdf")
# extract all text
extract_text(f)
# extract all text from page 1 only
extract_text(f, pages = 1)
# extract text from selected area only
extract_text(f, area = list(c(209.4, 140.5, 304.2, 500.8)))