extract_pdf_text {inlpubs} | R Documentation |
Extract Text from a PDF Document
Description
Extract text from any PDF document. Requires that the pdftools and tesseract packages are available.
Usage
extract_pdf_text(
input,
output = tempfile(fileext = ".txt"),
dpi = 600,
psm = 1
)
Arguments
input |
'character' string. File path to PDF document. |
output |
'character' string. Location to write the text file. |
dpi |
'integer' number between 100 and 1200. Dots per inch (DPI). The resolution of an image, specifically the number of pixels per inch. For optimal optical character recognition (OCR) accuracy, 600 DPI (the default) is recommended. |
psm |
|
Value
Returns the path to the text file. Each page from the PDF is transcribed as a separate line in the file.
Author(s)
J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center
See Also
add_content
function to add texts to the inlpubs-package corpus.
Examples
## Not run:
input <- system.file("extdata", "test.pdf", package = "inlpubs")
path <- extract_pdf_text(input)
unlink(path)
## End(Not run)