convert_tokens {pdfsearch} | R Documentation |
Ability to tokenize words.
Description
Ability to tokenize words.
Usage
convert_tokens(x, path = FALSE, split_pdf = FALSE,
remove_hyphen = TRUE, token_function = NULL)
Arguments
x |
The text of the pdf file. This can be specified directly or the pdftools package is used to read the pdf file from a file path. To use the pdftools, the path argument must be set to TRUE. |
path |
An optional path designation for the location of the pdf to be converted to text. The pdftools package is used for this conversion. |
split_pdf |
TRUE/FALSE indicating whether to split the pdf using white space. This would be most useful with multicolumn pdf files. The split_pdf function attempts to recreate the column layout of the text into a single column starting with the left column and proceeding to the right. |
remove_hyphen |
TRUE/FALSE indicating whether hyphenated words should be adjusted to combine onto a single line. Default is TRUE. |
token_function |
This is a function from the tokenizers package. Default is the tokenize_words function. |
Value
A list of character vectors containing the tokens. More detail can be found looking at the documentation of the tokenizers package.
Examples
file <- system.file('pdf', '1610.00147.pdf', package = 'pdfsearch')
convert_tokens(file, path = TRUE)