R: Ability to tokenize words.

convert_tokens {pdfsearch}

R Documentation

Ability to tokenize words.

Ability to tokenize words.

convert_tokens(x, path = FALSE, split_pdf = FALSE,
  remove_hyphen = TRUE, token_function = NULL)

`x`	The text of the pdf file. This can be specified directly or the pdftools package is used to read the pdf file from a file path. To use the pdftools, the path argument must be set to TRUE.
`path`	An optional path designation for the location of the pdf to be converted to text. The pdftools package is used for this conversion.
`split_pdf`	TRUE/FALSE indicating whether to split the pdf using white space. This would be most useful with multicolumn pdf files. The split_pdf function attempts to recreate the column layout of the text into a single column starting with the left column and proceeding to the right.
`remove_hyphen`	TRUE/FALSE indicating whether hyphenated words should be adjusted to combine onto a single line. Default is TRUE.
`token_function`	This is a function from the tokenizers package. Default is the tokenize_words function.

A list of character vectors containing the tokens. More detail can be found looking at the documentation of the tokenizers package.

 file <- system.file('pdf', '1610.00147.pdf', package = 'pdfsearch')
 convert_tokens(file, path = TRUE)

[Package pdfsearch version 0.3.0 Index]