extract_pdf_text {inlpubs}R Documentation

Extract Text from a PDF Document

Description

Extract text from any PDF document. Requires that the pdftools and tesseract packages are available.

Usage

extract_pdf_text(
  input,
  output = tempfile(fileext = ".txt"),
  dpi = 600,
  psm = 1
)

Arguments

input

'character' string. File path to PDF document.

output

'character' string. Location to write the text file.

dpi

'integer' number between 100 and 1200. Dots per inch (DPI). The resolution of an image, specifically the number of pixels per inch. For optimal optical character recognition (OCR) accuracy, 600 DPI (the default) is recommended.

psm

integer number between 0 and 13. Page Segmentation Mode (PSM). Describes the layout of the text you are trying to extract. For processing two columns of text you should use the page segmentation mode 1 (default). PSM 1 (default) is used to automatically segment the page into different text areas and also detect the orientation and script of the text.

Value

Returns the path to the text file. Each page from the PDF is transcribed as a separate line in the file.

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

See Also

add_content function to add texts to the inlpubs-package corpus.

Examples

## Not run: 
  input <- system.file("extdata", "test.pdf", package = "inlpubs")
  path <- extract_pdf_text(input)

  unlink(path)

## End(Not run)

[Package inlpubs version 1.1.3 Index]