batch_compute | Compute batches |
big_tokenize_transform | String tokenization and transformation for big data sets |
bytes_converter | bytes converter of a text file ( KB, MB or GB ) |
cluster_frequency | Frequencies of an existing cluster object |
cosine_distance | cosine distance of two character strings (each string consists of more than one words) |
COS_TEXT | Cosine similarity for text documents |
Count_Rows | Number of rows of a file |
dense_2sparse | convert a dense matrix to a sparse matrix |
dice_distance | dice similarity of words using n-grams |
dims_of_word_vecs | dimensions of a word vectors file |
Doc2Vec | Conversion of text documents to word-vector-representation features ( Doc2Vec ) |
JACCARD_DICE | Jaccard or Dice similarity for text documents |
levenshtein_distance | levenshtein distance of two words |
load_sparse_binary | load a sparse matrix in binary format |
matrix_sparsity | sparsity percentage of a sparse matrix |
read_characters | read a specific number of characters from a text file |
read_rows | read a specific number of rows from a text file |
save_sparse_binary | save a sparse matrix in binary format |
select_predictors | Exclude highly correlated predictors |
sparse_Means | RowMens and colMeans for a sparse matrix |
sparse_Sums | RowSums and colSums for a sparse matrix |
sparse_term_matrix | Term matrices and statistics ( document-term-matrix, term-document-matrix) |
TEXT_DOC_DISSIM | Dissimilarity calculation of text documents |
text_file_parser | text file parser |
text_intersect | intersection of words or letters in tokenized text |
tokenize_transform_text | String tokenization and transformation ( character string or path to a file ) |
tokenize_transform_vec_docs | String tokenization and transformation ( vector of documents ) |
token_stats | token statistics |
utf_locale | utf-locale for the available languages |
vocabulary_parser | returns the vocabulary counts for small or medium ( xml and not only ) files |