get_stoplist {text2map}R Documentation

Gets stoplist from precompiled lists

Description

Provides access to 8 precompiled stoplists, including the most commonly used stoplist from the Snowball stemming package ("snowball2014"), text2map's tiny stoplist ("tiny2020"), a few historically important stop lists. This aims to be a transparent and well-document collection of stoplists. Only includes English language stoplists at the moment.

Usage

get_stoplist(source = "tiny2020", language = "en", tidy = FALSE)

Arguments

source

Character indicating source, default = "tiny2020"

language

Character (default = "en") indicating language of stopwords by ISO 639-1 code, currently only English is supported.

tidy

logical (default = FALSE), returns a tibble

Details

There is no such thing as a stopword! But, there are tons of precompiled lists of words that someone thinks we should remove from our texts. (See for example: https://github.com/igorbrigadir/stopwords) One of the first stoplists is from C.J. van Rijsbergen's "Information retrieval: theory and practice" (1979) and includes 250 words. text2map's very own stoplist tiny2020 is a lean 34 words.

Below are stoplists available with get_stoplist:

The Snowball (2014) stoplist is likely the most commonly, it is the default in the stopwords package, which is used by quanteda, tidytext and tokenizers packages, followed closely by the Smart (1993) stoplist, the default in the tm package. The word counts for SMART (1993) and ONIX (2000) are slightly different than in other places because of duplicate words.

Value

Character vector of words to be stopped, if tidy = TRUE, a tibble is returned

Author(s)

Dustin Stoltz


[Package text2map version 0.2.0 Index]