R: Construct and filter n-grams

construct_ngrams {biogram}

R Documentation

Construct and filter n-grams

Description

Builds and selects important n-grams stepwise.

Usage

construct_ngrams(
  target,
  seq,
  u,
  n_max,
  conf_level = 0.95,
  gap = TRUE,
  use_heuristics = TRUE
)

Arguments

`target`	`integer` vector with target information (e.g. class labels).
`seq`	a vector or matrix describing sequence(s).
`u`	`integer`, `numeric` or `character` vector of all possible unigrams.
`n_max`	size of constructed n-grams.
`conf_level`	confidence level.
`gap`	`logical`, if `TRUE` gaps are used. See Details.
`use_heuristics`	if `FALSE` then all n-grams are tested. This may slow down computations significantly

Details

construct_ngrams starts by extracting unigrams from the sequences, pasting them together in all combination and choosing from them significant features (with p-value below conf_level). The chosen n-grams are further extended to the specified by n_max size by pasting unigrams at both ends.

The gap parameter determines if construct_ngrams performs the feature selection on exact n-grams (gap equal to FALSE) or on all features in the Hamming distance 1 from the n-gram (gap equal to TRUE).

Value

a vector of n-grams.

Examples

# to make the example faster, we run construct_ngrams() on the 
# subset of data
deg_seqs <- degenerate(human_cleave[c(1L:100, 801L:900), 1L:9],
list(`1` = c(1, 6, 8, 10, 11, 18),
     `2` = c(2, 13, 14, 16, 17),
     `3` = c(5, 19, 20),
     `4` = c(7, 9, 12, 15),
     '5' = c(3, 4)))
bigrams <- construct_ngrams(human_cleave[c(1L:100, 801L:900), "tar"], deg_seqs, 1L:5, 2)

[Package biogram version 1.6.3 Index]