R: Extract Text from a PDF Document

extract_pdf_text {inlpubs}

R Documentation

Extract Text from a PDF Document

Extract text from any PDF document. Requires that the pdftools and tesseract packages are available.

extract_pdf_text(
  input,
  output = tempfile(fileext = ".txt"),
  dpi = 600,
  psm = 1
)

`input`	'character' string. File path to PDF document.
`output`	'character' string. Location to write the text file.
`dpi`	'integer' number between 100 and 1200. Dots per inch (DPI). The resolution of an image, specifically the number of pixels per inch. For optimal optical character recognition (OCR) accuracy, 600 DPI (the default) is recommended.
`psm`	`integer` number between 0 and 13. Page Segmentation Mode (PSM). Describes the layout of the text you are trying to extract. For processing two columns of text you should use the page segmentation mode 1 (default). PSM 1 (default) is used to automatically segment the page into different text areas and also detect the orientation and script of the text.

Returns the path to the text file. Each page from the PDF is transcribed as a separate line in the file.

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

## Not run: 
  input <- system.file("extdata", "test.pdf", package = "inlpubs")
  path <- extract_pdf_text(input)

  unlink(path)

## End(Not run)

[Package inlpubs version 1.1.3 Index]