keyword_directory {pdfsearch} | R Documentation |
Wrapper for keyword search function
Description
This will use the keyword_search function to loop over all pdf files in a directory. Includes the ability to include subdirectories as well.
Usage
keyword_directory(directory, keyword, split_pdf = FALSE,
surround_lines = FALSE, ignore_case = FALSE, remove_hyphen = TRUE,
token_results = TRUE, convert_sentence = TRUE,
split_pattern = "\\p{WHITE_SPACE}{3,}", full_names = TRUE,
file_pattern = ".pdf", recursive = FALSE, max_search = NULL, ...)
Arguments
directory |
The directory to perform the search for pdf files to search. |
keyword |
The keyword(s) to be used to search in the text. Multiple keywords can be specified with a character vector. |
split_pdf |
TRUE/FALSE indicating whether to split the pdf using white space. This would be most useful with multicolumn pdf files. The split_pdf function attempts to recreate the column layout of the text into a single column starting with the left column and proceeding to the right. |
surround_lines |
numeric/FALSE indicating whether the output should extract the surrouding lines of text in addition to the matching line. Default is FALSE, if not false, include a numeric number that indicates the additional number of surrounding lines that will be extracted. |
ignore_case |
TRUE/FALSE/vector of TRUE/FALSE, indicating whether the case of the keyword matters. Default is FALSE meaning that case of the keyword is literal. If a vector, must be same length as the keyword vector. |
remove_hyphen |
TRUE/FALSE indicating whether hyphenated words should be adjusted to combine onto a single line. Default is TRUE. |
token_results |
TRUE/FALSE indicating whether the results text returned
should be split into tokens. See the tokenizers package and
|
convert_sentence |
TRUE/FALSE indicating if individual lines of PDF file should be collapsed into a single large paragraph to perform keyword searching. Default is TRUE |
split_pattern |
Regular expression pattern used to split multicolumn
PDF files using |
full_names |
TRUE/FALSE indicating if the full file path should be used.
Default is TRUE, see |
file_pattern |
An optional regular expression to select specific file
names. Only files that match the regular expression will be searched.
Defaults to all pdfs, i.e. |
recursive |
TRUE/FALSE indicating if subdirectories should be searched
as well.
Default is FALSE, see |
max_search |
An optional numeric vector indicating the maximum number of pdfs to search. Will only search the first n cases. |
... |
token_function to pass to |
Value
A tibble data frame that contains the keyword, location of match, the line of text match, and optionally the tokens associated with the line of text match. The output is combined (row binded) for all pdf input files.
Examples
# find directory
directory <- system.file('pdf', package = 'pdfsearch')
# do search over two files
keyword_directory(directory,
keyword = c('repeated measures', 'measurement error'),
surround_lines = 1, full_names = TRUE)
# can also split pdfs
keyword_directory(directory,
keyword = c('repeated measures', 'measurement error'),
split_pdf = TRUE, remove_hyphen = FALSE,
surround_lines = 1, full_names = TRUE)