skip_gram_sample_with_text_vocab {tfaddons} | R Documentation |
Skip gram sample with text vocab
Description
Skip-gram sampling with a text vocabulary file.
Usage
skip_gram_sample_with_text_vocab(
input_tensor,
vocab_freq_file,
vocab_token_index = 0,
vocab_token_dtype = tf$string,
vocab_freq_index = 1,
vocab_freq_dtype = tf$float64,
vocab_delimiter = ",",
vocab_min_count = NULL,
vocab_subsampling = NULL,
corpus_size = NULL,
min_skips = 1,
max_skips = 5,
start = 0,
limit = -1,
emit_self_as_target = FALSE,
batch_size = NULL,
batch_capacity = NULL,
seed = NULL,
name = NULL
)
Arguments
input_tensor |
A rank-1 'Tensor' from which to generate skip-gram candidates. |
vocab_freq_file |
'string' specifying full file path to the text vocab file. |
vocab_token_index |
'int' specifying which column in the text vocab file contains the tokens. |
vocab_token_dtype |
'DType' specifying the format of the tokens in the text vocab file. |
vocab_freq_index |
'int' specifying which column in the text vocab file contains the frequency counts of the tokens. |
vocab_freq_dtype |
'DType' specifying the format of the frequency counts in the text vocab file. |
vocab_delimiter |
'string' specifying the delimiter used in the text vocab file. |
vocab_min_count |
'int', 'float', or scalar 'Tensor' specifying minimum frequency threshold (from 'vocab_freq_file') for a token to be kept in 'input_tensor'. This should correspond with 'vocab_freq_dtype'. |
vocab_subsampling |
(Optional) 'float' specifying frequency proportion threshold for tokens from 'input_tensor'. Tokens that occur more frequently will be randomly down-sampled. Reasonable starting values may be around 1e-3 or 1e-5. See Eq. 5 in http://arxiv.org/abs/1310.4546 for more details. |
corpus_size |
(Optional) 'int', 'float', or scalar 'Tensor' specifying the total number of tokens in the corpus (e.g., sum of all the frequency counts of 'vocab_freq_file'). Used with 'vocab_subsampling' for down-sampling frequently occurring tokens. If this is specified, 'vocab_freq_file' and 'vocab_subsampling' must also be specified. If 'corpus_size' is needed but not supplied, then it will be calculated from 'vocab_freq_file'. You might want to supply your own value if you have already eliminated infrequent tokens from your vocabulary files (where frequency < vocab_min_count) to save memory in the internal token lookup table. Otherwise, the unused tokens' variables will waste memory. The user-supplied 'corpus_size' value must be greater than or equal to the sum of all the frequency counts of 'vocab_freq_file'. |
min_skips |
'int' or scalar 'Tensor' specifying the minimum window size to randomly use for each token. Must be >= 0 and <= 'max_skips'. If 'min_skips' and 'max_skips' are both 0, the only label outputted will be the token itself. |
max_skips |
'int' or scalar 'Tensor' specifying the maximum window size to randomly use for each token. Must be >= 0. |
start |
'int' or scalar 'Tensor' specifying the position in 'input_tensor' from which to start generating skip-gram candidates. |
limit |
'int' or scalar 'Tensor' specifying the maximum number of elements in 'input_tensor' to use in generating skip-gram candidates. -1 means to use the rest of the 'Tensor' after 'start'. |
emit_self_as_target |
'bool' or scalar 'Tensor' specifying whether to emit each token as a label for itself. |
batch_size |
(Optional) 'int' specifying batch size of returned 'Tensors'. |
batch_capacity |
(Optional) 'int' specifying batch capacity for the queue used for batching returned 'Tensors'. Only has an effect if 'batch_size' > 0. Defaults to 100 * 'batch_size' if not specified. |
seed |
(Optional) 'int' used to create a random seed for window size and subsampling. See ['set_random_seed'](../../g3doc/python/constant_op.md#set_random_seed) for behavior. |
name |
(Optional) A 'string' name or a name scope for the operations. |
Details
Wrapper around 'skip_gram_sample()' for use with a text vocabulary file. The vocabulary file is expected to be a plain-text file, with lines of 'vocab_delimiter'-separated columns. The 'vocab_token_index' column should contain the vocabulary term, while the 'vocab_freq_index' column should contain the number of times that term occurs in the corpus. For example, with a text vocabulary file of: “' bonjour,fr,42 hello,en,777 hola,es,99 “' You should set 'vocab_delimiter=","', 'vocab_token_index=0', and 'vocab_freq_index=2'. See 'skip_gram_sample()' documentation for more details about the skip-gram sampling process.
Value
A 'list' containing (token, label) 'Tensors'. Each output 'Tensor' is of rank-1 and has the same type as 'input_tensor'. The 'Tensors' will be of length 'batch_size'; if 'batch_size' is not specified, they will be of random length, though they will be in sync with each other as long as they are evaluated together.
Raises
ValueError: If 'vocab_token_index' or 'vocab_freq_index' is less than 0 or exceeds the number of columns in 'vocab_freq_file'. If 'vocab_token_index' and 'vocab_freq_index' are both set to the same column. If any token in 'vocab_freq_file' has a negative frequency.