PDE_pdfs2txt_searchandfilter {PDE} | R Documentation |
Extracting sentences from a PDF (Portable Document Format) file
Description
PDE_pdfs2txt_searchandfilter
extracts sentences from a single PDF file
according to search and filter words and writes output in the corresponding
folder.
Usage
PDE_pdfs2txt_searchandfilter(
pdfs,
out = ".",
filter.words = "",
regex.fw = TRUE,
ignore.case.fw = FALSE,
filter.word.times = "0.2%",
search.words,
search.word.categories = NULL,
regex.sw = TRUE,
ignore.case.sw = FALSE,
eval.abbrevs = TRUE,
out.table.format = ".csv (WINDOWS-1252)",
context = 0,
write.txt.doc.file = TRUE,
delete = TRUE,
cpy_mv = "nocpymv",
verbose = TRUE
)
Arguments
pdfs |
String. A list of paths to the PDF files to be analyzed. |
out |
String. Directory chosen to save analysis results in. Default:
|
filter.words |
List of strings. The list of filter words. If not
|
regex.fw |
Logical. If TRUE filter words will follow the regex rules
(see https://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).
Default = |
ignore.case.fw |
Logical. Are the filter words case-sensitive (does
capitalization matter)? Default: |
filter.word.times |
Numeric or string. Can either be expressed as absolute number or percentage
of the total number of words (by adding the "
|
search.words |
List of strings. List of search words. |
search.word.categories |
List of strings. List of categories with the
same length as the list of search words. Accordingly, each search word can be
assigned to a category, of which the word counts will be summarized in the
|
regex.sw |
Logical. If TRUE search words will follow the regex rules
(see https://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).
Default = |
ignore.case.sw |
Logical. Are the search words case-sensitive (does
capitalization matter)? Default: |
eval.abbrevs |
Logical. Should abbreviations for the search words be
automatically detected and then replaced with the search word + "$*"?
Default: |
out.table.format |
String. Output file format. Either comma separated
file |
context |
Numeric. Number of sentences extracted before and after the
sentence with the detected search word. If |
write.txt.doc.file |
Logical. If |
delete |
Logical. If |
cpy_mv |
String. Either "nocpymv", "cpy", or "mv". If filter words are used in the
analyses, the processed PDF files will either be copied ("cpy") or moved ("mv") into the
/pdf/ subfolder of the output folder. Default: |
verbose |
Logical. Indicates whether messages will be printed in the console. Default: |
See Also
Examples
## Running a simple analysis with filter and search words to extract sentences
if(PDE_check_Xpdf_install() == TRUE){
outputtables <- PDE_pdfs2txt_searchandfilter(pdf = paste0(system.file(package = "PDE"),
"/examples/Methotrexate/29973177_!.pdf"),
out = paste0(system.file(package = "PDE"),"/examples/MTX_txt+-0/"),
filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]],
regex.fw = FALSE,
ignore.case.fw = TRUE,
search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]],
regex.sw = TRUE,
ignore.case.sw = FALSE)
}
## Running an advanced analysis with filter and search words to
## extract sentences and obtain documentation files
if(PDE_check_Xpdf_install() == TRUE){
outputtables <- PDE_pdfs2txt_searchandfilter(pdf = paste0(system.file(package = "PDE"),
"/examples/Methotrexate/29973177_!.pdf"),
out = paste0(system.file(package = "PDE"),"/examples/MTX_txt+-1/"),
context = 1,
filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]],
regex.fw = FALSE,
ignore.case.fw = TRUE,
filter.word.times = "0.2%",
search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]],
regex.sw = TRUE,
ignore.case.sw = FALSE,
eval.abbrevs = TRUE,
out.table.format = ".csv (WINDOWS-1252)",
write.txt.doc.file = TRUE,
cpy_mv = "nocpymv",
delete = TRUE)
}