create_ifl {keyperm} | R Documentation |
Create an Indexed Frequency List
Description
The keyperm package stores frequency lists in a special data structure called indexed frequency list. This can currently be created from a tdm object as implemented in the tm package.
Indexed frequency lists are essentially frequency lists stored in a three-column format,
similar to the simple triplet matrix internally used by tm to store term-document-matrices.
The first column stores number of document i
, second number of term j
and the third the
frequencies with which the term j
occurs in document i
. Zero occurences are omitted.
All columns contain integers, and the frequency list is sorted by document.
The object returned is of class indexed_frequency_list
. In addition to the actual frequency
list it contains an index for fast access as well as pre-computed total number of tokens per
document and total occurences per term.
Usage
create_ifl(
tdm,
subset_terms = 1:dim(tdm)[1],
subset_docs = 1:dim(tdm)[2],
corpus
)
Arguments
tdm |
a tdm-matrix from the tm package. Currently, this is the only supported input, but others may be added in later versions. |
subset_terms |
vector of terms to be considered. Can be integer (indices) or boolean. Terms not included still are counted for total number of token per document. |
subset_docs |
vector of documents to be considered. Can be integer (indices) or boolean. Documents excluded do not contribute to total number of occurences of a term. |
corpus |
vector indicating which documents belong to corpus A (first corpus). Can be integer (indices) or boolean. Currently, only comparisons of two corpora are supported. |
Value
A list with class indexed_frequency_list
containing the following components: