skip_gram_sample_with_text_vocab {tfaddons}R Documentation

Skip gram sample with text vocab

Description

Skip-gram sampling with a text vocabulary file.

Usage

skip_gram_sample_with_text_vocab(
  input_tensor,
  vocab_freq_file,
  vocab_token_index = 0,
  vocab_token_dtype = tf$string,
  vocab_freq_index = 1,
  vocab_freq_dtype = tf$float64,
  vocab_delimiter = ",",
  vocab_min_count = NULL,
  vocab_subsampling = NULL,
  corpus_size = NULL,
  min_skips = 1,
  max_skips = 5,
  start = 0,
  limit = -1,
  emit_self_as_target = FALSE,
  batch_size = NULL,
  batch_capacity = NULL,
  seed = NULL,
  name = NULL
)

Arguments

input_tensor

A rank-1 'Tensor' from which to generate skip-gram candidates.

vocab_freq_file

'string' specifying full file path to the text vocab file.

vocab_token_index

'int' specifying which column in the text vocab file contains the tokens.

vocab_token_dtype

'DType' specifying the format of the tokens in the text vocab file.

vocab_freq_index

'int' specifying which column in the text vocab file contains the frequency counts of the tokens.

vocab_freq_dtype

'DType' specifying the format of the frequency counts in the text vocab file.

vocab_delimiter

'string' specifying the delimiter used in the text vocab file.

vocab_min_count

'int', 'float', or scalar 'Tensor' specifying minimum frequency threshold (from 'vocab_freq_file') for a token to be kept in 'input_tensor'. This should correspond with 'vocab_freq_dtype'.

vocab_subsampling

(Optional) 'float' specifying frequency proportion threshold for tokens from 'input_tensor'. Tokens that occur more frequently will be randomly down-sampled. Reasonable starting values may be around 1e-3 or 1e-5. See Eq. 5 in http://arxiv.org/abs/1310.4546 for more details.

corpus_size

(Optional) 'int', 'float', or scalar 'Tensor' specifying the total number of tokens in the corpus (e.g., sum of all the frequency counts of 'vocab_freq_file'). Used with 'vocab_subsampling' for down-sampling frequently occurring tokens. If this is specified, 'vocab_freq_file' and 'vocab_subsampling' must also be specified. If 'corpus_size' is needed but not supplied, then it will be calculated from 'vocab_freq_file'. You might want to supply your own value if you have already eliminated infrequent tokens from your vocabulary files (where frequency < vocab_min_count) to save memory in the internal token lookup table. Otherwise, the unused tokens' variables will waste memory. The user-supplied 'corpus_size' value must be greater than or equal to the sum of all the frequency counts of 'vocab_freq_file'.

min_skips

'int' or scalar 'Tensor' specifying the minimum window size to randomly use for each token. Must be >= 0 and <= 'max_skips'. If 'min_skips' and 'max_skips' are both 0, the only label outputted will be the token itself.

max_skips

'int' or scalar 'Tensor' specifying the maximum window size to randomly use for each token. Must be >= 0.

start

'int' or scalar 'Tensor' specifying the position in 'input_tensor' from which to start generating skip-gram candidates.

limit

'int' or scalar 'Tensor' specifying the maximum number of elements in 'input_tensor' to use in generating skip-gram candidates. -1 means to use the rest of the 'Tensor' after 'start'.

emit_self_as_target

'bool' or scalar 'Tensor' specifying whether to emit each token as a label for itself.

batch_size

(Optional) 'int' specifying batch size of returned 'Tensors'.

batch_capacity

(Optional) 'int' specifying batch capacity for the queue used for batching returned 'Tensors'. Only has an effect if 'batch_size' > 0. Defaults to 100 * 'batch_size' if not specified.

seed

(Optional) 'int' used to create a random seed for window size and subsampling. See ['set_random_seed'](../../g3doc/python/constant_op.md#set_random_seed) for behavior.

name

(Optional) A 'string' name or a name scope for the operations.

Details

Wrapper around 'skip_gram_sample()' for use with a text vocabulary file. The vocabulary file is expected to be a plain-text file, with lines of 'vocab_delimiter'-separated columns. The 'vocab_token_index' column should contain the vocabulary term, while the 'vocab_freq_index' column should contain the number of times that term occurs in the corpus. For example, with a text vocabulary file of: “' bonjour,fr,42 hello,en,777 hola,es,99 “' You should set 'vocab_delimiter=","', 'vocab_token_index=0', and 'vocab_freq_index=2'. See 'skip_gram_sample()' documentation for more details about the skip-gram sampling process.

Value

A 'list' containing (token, label) 'Tensors'. Each output 'Tensor' is of rank-1 and has the same type as 'input_tensor'. The 'Tensors' will be of length 'batch_size'; if 'batch_size' is not specified, they will be of random length, though they will be in sync with each other as long as they are evaluated together.

Raises

ValueError: If 'vocab_token_index' or 'vocab_freq_index' is less than 0 or exceeds the number of columns in 'vocab_freq_file'. If 'vocab_token_index' and 'vocab_freq_index' are both set to the same column. If any token in 'vocab_freq_file' has a negative frequency.


[Package tfaddons version 0.10.0 Index]