batch_compute |
Compute batches |
big_tokenize_transform |
String tokenization and transformation for big data sets |
bytes_converter |
bytes converter of a text file ( KB, MB or GB ) |
cluster_frequency |
Frequencies of an existing cluster object |
cosine_distance |
cosine distance of two character strings (each string consists of more than one words) |
COS_TEXT |
Cosine similarity for text documents |
Count_Rows |
Number of rows of a file |
dense_2sparse |
convert a dense matrix to a sparse matrix |
dice_distance |
dice similarity of words using n-grams |
dims_of_word_vecs |
dimensions of a word vectors file |
Doc2Vec |
Conversion of text documents to word-vector-representation features ( Doc2Vec ) |
JACCARD_DICE |
Jaccard or Dice similarity for text documents |
levenshtein_distance |
levenshtein distance of two words |
load_sparse_binary |
load a sparse matrix in binary format |
matrix_sparsity |
sparsity percentage of a sparse matrix |
read_characters |
read a specific number of characters from a text file |
read_rows |
read a specific number of rows from a text file |
save_sparse_binary |
save a sparse matrix in binary format |
select_predictors |
Exclude highly correlated predictors |
sparse_Means |
RowMens and colMeans for a sparse matrix |
sparse_Sums |
RowSums and colSums for a sparse matrix |
sparse_term_matrix |
Term matrices and statistics ( document-term-matrix, term-document-matrix) |
TEXT_DOC_DISSIM |
Dissimilarity calculation of text documents |
text_file_parser |
text file parser |
text_intersect |
intersection of words or letters in tokenized text |
tokenize_transform_text |
String tokenization and transformation ( character string or path to a file ) |
tokenize_transform_vec_docs |
String tokenization and transformation ( vector of documents ) |
token_stats |
token statistics |
utf_locale |
utf-locale for the available languages |
vocabulary_parser |
returns the vocabulary counts for small or medium ( xml and not only ) files |