tidystopwords {tidystopwords} | R Documentation |
Customisable Lists of Stop-Words in 110 Languages
Description
The idea behind this package is to give the user control over the stop-word selection.
Details
The idea behind this package is to give the user control over the stop-word
selection. The core generate_stoplist
function relies on
multilingual_stopwords
, a large data frame derived from the current
release of the Universal Dependencies Treebanks. We have included all languages
whose corpora totalled above 10,000 tokens – large enough to cover all common
closed-class words, such as prepositions, conjunctions, and auxiliary verbs.
The data comes encoded in UTF-8.
Author(s)
Silvie Cinková, Maciej Eder
References
The data set is based on the official release of Version 2.1 of Universal Dependencies.
https://universaldependencies.org
Nivre, Joakim; Agić, Željko; Ahrenberg, Lars; et al., 2017, Universal Dependencies 2.1, LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-2515.
See Also
list_supported_languages
, multilingual_stoplist