R: extract

extract_text {tabulapdf}

R Documentation

extract_text

Description

Extract text from a file

Usage

extract_text(
  file,
  pages = NULL,
  area = NULL,
  password = NULL,
  encoding = NULL,
  copy = FALSE
)

Arguments

`file`	A character string specifying the path or URL to a PDF file.
`pages`	An optional integer vector specifying pages to extract from.
`area`	An optional list, of length equal to the number of pages specified, where each entry contains a four-element numeric vector of coordinates (top,left,bottom,right) containing the table for the corresponding page. As a convenience, a list of length 1 can be used to extract the same area from all (specified) pages.
`password`	Optionally, a character string containing a user password to access a secured PDF.
`encoding`	Optionally, a character string specifying an encoding for the text, to be passed to the assignment method of `Encoding`.
`copy`	Specifies whether the original local file(s) should be copied to `tempdir()` before processing. `FALSE` by default. The argument is ignored if `file` is URL.

Details

This function converts the contents of a PDF file into a single unstructured character string.

Value

If pages = NULL (the default), a length 1 character vector, otherwise a vector of length length(pages).

Author(s)

Thomas J. Leeper <thosjleeper@gmail.com>

Examples

# simple demo file
f <- system.file("examples", "fortytwo.pdf", package = "tabulapdf")

# extract all text
extract_text(f)

# extract all text from page 1 only
extract_text(f, pages = 1)

# extract text from selected area only
extract_text(f, area = list(c(209.4, 140.5, 304.2, 500.8)))

[Package tabulapdf version 1.0.5-3 Index]