R: Count n-grams in sequences

count_ngrams {biogram}

R Documentation

Count n-grams in sequences

Description

Counts all n-grams or position-specific n-grams present in the input sequence(s).

Usage

count_ngrams(seq, n, u, d = 0, pos = FALSE, scale = FALSE, threshold = 0)

Arguments

`seq`	a vector or matrix describing sequence(s).
`n`	`integer` size of n-gram.
`u`	`integer`, `numeric` or `character` vector of all possible unigrams.
`d`	`integer` vector of distances between elements of n-gram (0 means consecutive elements). See Details.
`pos`	`logical`, if `TRUE` position-specific n_grams are counted.
`scale`	`logical`, if `TRUE` output data is normalized. May be applied only to the counts of n-grams without position information. See `Details`.
`threshold`	`integer`, if not equal to 0, data is binarized into two groups (larger or equal to threshold vs. smaller than threshold).

Details

A distance vector should be always n - 1 in length. For example when n = 3, d = c(1,2) means A_A__A. For n = 4, d = c(2,0,1) means A__AA_A. If vector d has length 1, it is recycled to length n - 1.

n-gram names follow a specific convention and have three parts for position-specific n-grams and two parts otherwise. The parts are separated by _. The . symbol is used to separate elements within a part. The general naming scheme is POSITION_NGRAM_DISTANCE. The optional POSITION part of the name indicates the actual position of the n-gram in the sequence(s) and will be present only if pos = TRUE. This part is always a single integer. The NGRAM part of the name is a sequence of elements in the n-gram. For example, 4.2.2 indicates the n-gram 422 (e.g. TCC). The DISTANCE part of the name is a vector of distance(s). For example, 0.0 indicates zero distances (continuous n-grams), while 1.2 represents distances for the n-gram A_A__A.

Examples of n-gram names:

46_4.4.4_0.1 : trigram 44_4 on position 46
12_2.1_2 : bigram 2__1 on position 12
8_1.1.1_0.0 : continuous trigram 111 on position 8
1.1.1_0.0 : continuous trigram 111 without position information

Value

a simple_triplet_matrix where columns represent n-grams and rows sequences. See Details for specifics of the naming convention.

Note

By default, the counted n-gram data is stored in a memory-saving format. To convert an object to a 'classical' matrix use the as.matrix function. See examples for further information.

Examples

# count trigrams without position information for nucleotides
count_ngrams(sample(1L:4, 50, replace = TRUE), 3, 1L:4, pos = FALSE)
# count position-specific trigrams from multiple nucleotide sequences
seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50)
ngrams <- count_ngrams(seqs, 3, 1L:4, pos = TRUE)
# output results of the n-gram counting to screen
as.matrix(ngrams)

[Package biogram version 1.6.3 Index]