extract_text {tabulapdf}R Documentation

extract_text

Description

Extract text from a file

Usage

extract_text(
  file,
  pages = NULL,
  area = NULL,
  password = NULL,
  encoding = NULL,
  copy = FALSE
)

Arguments

file

A character string specifying the path or URL to a PDF file.

pages

An optional integer vector specifying pages to extract from.

area

An optional list, of length equal to the number of pages specified, where each entry contains a four-element numeric vector of coordinates (top,left,bottom,right) containing the table for the corresponding page. As a convenience, a list of length 1 can be used to extract the same area from all (specified) pages.

password

Optionally, a character string containing a user password to access a secured PDF.

encoding

Optionally, a character string specifying an encoding for the text, to be passed to the assignment method of Encoding.

copy

Specifies whether the original local file(s) should be copied to tempdir() before processing. FALSE by default. The argument is ignored if file is URL.

Details

This function converts the contents of a PDF file into a single unstructured character string.

Value

If pages = NULL (the default), a length 1 character vector, otherwise a vector of length length(pages).

Author(s)

Thomas J. Leeper <thosjleeper@gmail.com>

See Also

extract_tables, extract_areas, split_pdf

Examples

# simple demo file
f <- system.file("examples", "fortytwo.pdf", package = "tabulapdf")

# extract all text
extract_text(f)

# extract all text from page 1 only
extract_text(f, pages = 1)

# extract text from selected area only
extract_text(f, area = list(c(209.4, 140.5, 304.2, 500.8)))

[Package tabulapdf version 1.0.5-3 Index]