| getPDF {inpdfr} | R Documentation |
Extract text from PDF files and return a word-occurrence data.frame.
Description
getPDF returns a word-occurrence data.frame from PDF files.
It needs XPDF in order to run (http://www.foolabs.com/xpdf/download.html),
and uses parallel to perform parallel computation.
Usage
getPDF(
myPDFs,
minword = 1,
maxword = 20,
minFreqWord = 1,
pathToPdftotext = ""
)
Arguments
myPDFs |
A character vector containing PDF file names. |
minword |
An integer specifying the minimum number of letters per word into the returned data.frame. |
maxword |
An integer to specifying the maximum number of letters per word into the returned data.frame. |
minFreqWord |
An integer specifying the minimum word frequency into the returned data.frame. |
pathToPdftotext |
A character containing an alternative path to XPDF
|
Details
getPDF uses XPDF pdftotext function to extract the
content of PDF files into a TXT file. If pdftotext is not in the
PATH, an alternative is to provide the full path of the program into
the pathToPdftotext parameter.
Value
A list of list with word-occurrence data.frame and file name.
Examples
## Not run:
getPDF(myPDFs = "mypdf.pdf")
## End(Not run)