getPDF {inpdfr} | R Documentation |
Extract text from PDF files and return a word-occurrence data.frame.
Description
getPDF
returns a word-occurrence data.frame from PDF files.
It needs XPDF
in order to run (http://www.foolabs.com/xpdf/download.html),
and uses parallel
to perform parallel computation.
Usage
getPDF(
myPDFs,
minword = 1,
maxword = 20,
minFreqWord = 1,
pathToPdftotext = ""
)
Arguments
myPDFs |
A character vector containing PDF file names. |
minword |
An integer specifying the minimum number of letters per word into the returned data.frame. |
maxword |
An integer to specifying the maximum number of letters per word into the returned data.frame. |
minFreqWord |
An integer specifying the minimum word frequency into the returned data.frame. |
pathToPdftotext |
A character containing an alternative path to XPDF
|
Details
getPDF
uses XPDF pdftotext
function to extract the
content of PDF files into a TXT file. If pdftotext
is not in the
PATH
, an alternative is to provide the full path of the program into
the pathToPdftotext
parameter.
Value
A list of list with word-occurrence data.frame and file name.
Examples
## Not run:
getPDF(myPDFs = "mypdf.pdf")
## End(Not run)