skipgram_identify {eHDPrep}R Documentation

Identify Neighbouring Words (Skipgrams) in a free-text vector

Description

Identifies words which appear near each other in the free-text variable (var), referred to as "Skipgrams". Supported languages for stop words and stemming are danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, portuguese, russian, spanish, and swedish.

Usage

skipgram_identify(
  x,
  ids,
  num_of_words = 2,
  max_interrupt_words = 2,
  words_to_rm = NULL,
  lan = "english"
)

Arguments

x

Free-text character vector to query.

ids

Character vector containing IDs for each element of var.

num_of_words

Number of words to consider for each returned skipgram. Default = 2.

max_interrupt_words

Maximum number of words which can interrupt proximal words. Default = 2.

words_to_rm

Character vector of words which should not be considered.

lan

Language of var. Default: english.

Value

Tibble containing skipgrams as variables and patient values as rows.

References

Guthrie, D., Allison, B., Liu, W., Guthrie, L. & Wilks, Y. A Closer Look at Skip-gram Modelling. in Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06) (European Language Resources Association (ELRA), 2006).

Benoit K, Watanabe K, Wang H, Nulty P, Obeng A, Müller S, Matsuo A (2018). “quanteda: An R package for the quantitative analysis of textual data.” _Journal of Open Source Software_, *3*(30), 774. doi:10.21105/joss.00774 <https://doi.org/10.21105/joss.00774>, <https://quanteda.io>.

Feinerer I, Hornik K (2020). _tm: Text Mining Package_. R package version 0.7-8, <https://CRAN.R-project.org/package=tm>.

Ingo Feinerer, Kurt Hornik, and David Meyer (2008). Text Mining Infrastructure in R. Journal of Statistical Software 25(5): 1-54. URL: https://www.jstatsoft.org/v25/i05/.

See Also

Principle underlying function: tokens_ngrams

Other free text functions: extract_freetext(), skipgram_append(), skipgram_freq()

Examples

data(example_data)
skipgram_identify(x = example_data$free_text,
                  ids = example_data$patient_id,
                  max_interrupt_words = 5)

[Package eHDPrep version 1.3.3 Index]