R: Conversion of text documents to word-vector-representation...

Doc2Vec {textTinyR}

R Documentation

Conversion of text documents to word-vector-representation features ( Doc2Vec )

Description

Conversion of text documents to word-vector-representation features ( Doc2Vec )

Usage

# utl <- Doc2Vec$new(token_list = NULL, word_vector_FILE = NULL,

       #                    print_every_rows = 10000, verbose = FALSE,

       #                    copy_data = FALSE)

Details

the pre_processed_wv method should be used after the initialization of the Doc2Vec class, if the copy_data parameter is set to TRUE, in order to inspect the pre-processed word-vectors.

The global_term_weights method is part of the sparse_term_matrix R6 class of the textTinyR package. One can come to the correct global_term_weights by using the sparse_term_matrix class and by setting the tf_idf parameter to FALSE and the normalize parameter to NULL. In Doc2Vec class, if method equals to idf then the global_term_weights parameter should not be equal to NULL.

Explanation of the various methods :

sum_sqrt: Assuming that a single sublist of the token list will be taken into consideration : the wordvectors of each word of the sublist of tokens will be accumulated to a vector equal to the length of the wordvector (INITIAL_WORD_VECTOR). Then a scalar will be computed using this INITIAL_WORD_VECTOR in the following way : the INITIAL_WORD_VECTOR will be raised to the power of 2.0, then the resulted wordvector will be summed and the square-root will be calculated. The INITIAL_WORD_VECTOR will be divided by the resulted scalar
min_max_norm: Assuming that a single sublist of the token list will be taken into consideration : the wordvectors of each word of the sublist of tokens will be first min-max normalized and then will be accumulated to a vector equal to the length of the initial wordvector
idf: Assuming that a single sublist of the token list will be taken into consideration : the word-vector of each term in the sublist will be multiplied with the corresponding idf of the global weights term

There might be slight differences in the output data for each method depending on the input value of the copy_data parameter (if it's either TRUE or FALSE).

Value

a matrix

Methods

Doc2Vec$new(token_list = NULL, word_vector_FILE = NULL, print_every_rows = 10000, verbose = FALSE, copy_data = FALSE)
--------------
doc2vec_methods(method = "sum_sqrt", global_term_weights = NULL, threads = 1)
--------------
pre_processed_wv()

Methods

Method `new()`

Usage

Doc2Vec$new(
  token_list = NULL,
  word_vector_FILE = NULL,
  print_every_rows = 10000,
  verbose = FALSE,
  copy_data = FALSE
)

Arguments

token_list: either NULL or a list of tokenized text documents
word_vector_FILE: a valid path to a text file, where the word-vectors are saved
print_every_rows: a numeric value greater than 1 specifying the print intervals. Frequent output in the R session can slow down the function especially in case of big files.
verbose: either TRUE or FALSE. If TRUE then information will be printed out in the R session.
copy_data: either TRUE or FALSE. If FALSE then a pointer will be created and no copy of the initial data takes place (memory efficient especially for big datasets). This is an alternative way to pre-process the data.

Method `doc2vec_methods()`

Usage

Doc2Vec$doc2vec_methods(
  method = "sum_sqrt",
  global_term_weights = NULL,
  threads = 1
)

Arguments

method: a character string specifying the method to use. One of sum_sqrt, min_max_norm or idf. See the details section for more information.
global_term_weights: either NULL or the output of the global_term_weights method of the textTinyR package. See the details section for more information.
threads: a numeric value specifying the number of cores to run in parallel

Method `pre_processed_wv()`

Usage

Doc2Vec$pre_processed_wv()

Method `clone()`

The objects of this class are cloneable with this method.

Usage

Doc2Vec$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Examples


library(textTinyR)

#---------------------------------
# tokenized text in form of a list
#---------------------------------

tok_text = list(c('the', 'result', 'of'), c('doc2vec', 'are', 'vector', 'features'))

#-------------------------
# path to the word vectors
#-------------------------

PATH = system.file("example_files", "word_vecs.txt", package = "textTinyR")


init = Doc2Vec$new(token_list = tok_text, word_vector_FILE = PATH)


out = init$doc2vec_methods(method = "sum_sqrt")

[Package textTinyR version 1.1.8 Index]

Conversion of text documents to word-vector-representation features ( Doc2Vec )

Description

Usage

Details

Value

Methods

Methods

Public methods

Method new()

Usage

Arguments

Method doc2vec_methods()

Usage

Arguments

Method pre_processed_wv()

Usage

Method clone()

Usage

Arguments

Examples

Method `new()`

Method `doc2vec_methods()`

Method `pre_processed_wv()`

Method `clone()`