Doc2Vec {textTinyR} | R Documentation |
Conversion of text documents to word-vector-representation features ( Doc2Vec )
Description
Conversion of text documents to word-vector-representation features ( Doc2Vec )
Conversion of text documents to word-vector-representation features ( Doc2Vec )
Usage
# utl <- Doc2Vec$new(token_list = NULL, word_vector_FILE = NULL,
# print_every_rows = 10000, verbose = FALSE,
# copy_data = FALSE)
Details
the pre_processed_wv method should be used after the initialization of the Doc2Vec class, if the copy_data parameter is set to TRUE, in order to inspect the pre-processed word-vectors.
The global_term_weights method is part of the sparse_term_matrix R6 class of the textTinyR package. One can come to the correct global_term_weights by using the sparse_term_matrix class and by setting the tf_idf parameter to FALSE and the normalize parameter to NULL. In Doc2Vec class, if method equals to idf then the global_term_weights parameter should not be equal to NULL.
Explanation of the various methods :
- sum_sqrt
Assuming that a single sublist of the token list will be taken into consideration : the wordvectors of each word of the sublist of tokens will be accumulated to a vector equal to the length of the wordvector (INITIAL_WORD_VECTOR). Then a scalar will be computed using this INITIAL_WORD_VECTOR in the following way : the INITIAL_WORD_VECTOR will be raised to the power of 2.0, then the resulted wordvector will be summed and the square-root will be calculated. The INITIAL_WORD_VECTOR will be divided by the resulted scalar
- min_max_norm
Assuming that a single sublist of the token list will be taken into consideration : the wordvectors of each word of the sublist of tokens will be first min-max normalized and then will be accumulated to a vector equal to the length of the initial wordvector
- idf
Assuming that a single sublist of the token list will be taken into consideration : the word-vector of each term in the sublist will be multiplied with the corresponding idf of the global weights term
There might be slight differences in the output data for each method depending on the input value of the copy_data parameter (if it's either TRUE or FALSE).
Value
a matrix
Methods
Doc2Vec$new(token_list = NULL, word_vector_FILE = NULL, print_every_rows = 10000, verbose = FALSE, copy_data = FALSE)
--------------
doc2vec_methods(method = "sum_sqrt", global_term_weights = NULL, threads = 1)
--------------
pre_processed_wv()
Methods
Public methods
Method new()
Usage
Doc2Vec$new( token_list = NULL, word_vector_FILE = NULL, print_every_rows = 10000, verbose = FALSE, copy_data = FALSE )
Arguments
token_list
either NULL or a list of tokenized text documents
word_vector_FILE
a valid path to a text file, where the word-vectors are saved
print_every_rows
a numeric value greater than 1 specifying the print intervals. Frequent output in the R session can slow down the function especially in case of big files.
verbose
either TRUE or FALSE. If TRUE then information will be printed out in the R session.
copy_data
either TRUE or FALSE. If FALSE then a pointer will be created and no copy of the initial data takes place (memory efficient especially for big datasets). This is an alternative way to pre-process the data.
Method doc2vec_methods()
Usage
Doc2Vec$doc2vec_methods( method = "sum_sqrt", global_term_weights = NULL, threads = 1 )
Arguments
method
a character string specifying the method to use. One of sum_sqrt, min_max_norm or idf. See the details section for more information.
global_term_weights
either NULL or the output of the global_term_weights method of the textTinyR package. See the details section for more information.
threads
a numeric value specifying the number of cores to run in parallel
Method pre_processed_wv()
Usage
Doc2Vec$pre_processed_wv()
Method clone()
The objects of this class are cloneable with this method.
Usage
Doc2Vec$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
Examples
library(textTinyR)
#---------------------------------
# tokenized text in form of a list
#---------------------------------
tok_text = list(c('the', 'result', 'of'), c('doc2vec', 'are', 'vector', 'features'))
#-------------------------
# path to the word vectors
#-------------------------
PATH = system.file("example_files", "word_vecs.txt", package = "textTinyR")
init = Doc2Vec$new(token_list = tok_text, word_vector_FILE = PATH)
out = init$doc2vec_methods(method = "sum_sqrt")