attention_bahdanau_monotonic {tfaddons}R Documentation

Bahdanau Monotonic Attention

Description

Monotonic attention mechanism with Bahadanau-style energy function.

Usage

attention_bahdanau_monotonic(
  object,
  units,
  memory = NULL,
  memory_sequence_length = NULL,
  normalize = FALSE,
  sigmoid_noise = 0,
  sigmoid_noise_seed = NULL,
  score_bias_init = 0,
  mode = "parallel",
  kernel_initializer = "glorot_uniform",
  dtype = NULL,
  name = "BahdanauMonotonicAttention",
  ...
)

Arguments

object

Model or layer object

units

The depth of the query mechanism.

memory

The memory to query; usually the output of an RNN encoder. This tensor should be shaped [batch_size, max_time, ...].

memory_sequence_length

(optional): Sequence lengths for the batch entries in memory. If provided, the memory tensor rows are masked with zeros for values past the respective sequence lengths.

normalize

Python boolean. Whether to normalize the energy term.

sigmoid_noise

Standard deviation of pre-sigmoid noise. See the docstring for '_monotonic_probability_fn' for more information.

sigmoid_noise_seed

(optional) Random seed for pre-sigmoid noise.

score_bias_init

Initial value for score bias scalar. It's recommended to initialize this to a negative value when the length of the memory is large.

mode

How to compute the attention distribution. Must be one of 'recursive', 'parallel', or 'hard'. See the docstring for tfa.seq2seq.monotonic_attention for more information.

kernel_initializer

(optional), the name of the initializer for the attention kernel.

dtype

The data type for the query and memory layers of the attention mechanism.

name

Name to use when creating ops.

...

A list that contains other common arguments for layer creation.

Details

This type of attention enforces a monotonic constraint on the attention distributions; that is once the model attends to a given point in the memory it can't attend to any prior points at subsequence output timesteps. It achieves this by using the _monotonic_probability_fn instead of softmax to construct its attention distributions. Since the attention scores are passed through a sigmoid, a learnable scalar bias parameter is applied after the score function and before the sigmoid. Otherwise, it is equivalent to BahdanauAttention. This approach is proposed in

Colin Raffel, Minh-Thang Luong, Peter J. Liu, Ron J. Weiss, Douglas Eck, "Online and Linear-Time Attention by Enforcing Monotonic Alignments." ICML 2017. https://arxiv.org/abs/1704.00784

Value

None


[Package tfaddons version 0.10.0 Index]