attention_bahdanau_monotonic {tfaddons} | R Documentation |
Bahdanau Monotonic Attention
Description
Monotonic attention mechanism with Bahadanau-style energy function.
Usage
attention_bahdanau_monotonic(
object,
units,
memory = NULL,
memory_sequence_length = NULL,
normalize = FALSE,
sigmoid_noise = 0,
sigmoid_noise_seed = NULL,
score_bias_init = 0,
mode = "parallel",
kernel_initializer = "glorot_uniform",
dtype = NULL,
name = "BahdanauMonotonicAttention",
...
)
Arguments
object |
Model or layer object |
units |
The depth of the query mechanism. |
memory |
The memory to query; usually the output of an RNN encoder. This tensor should be shaped [batch_size, max_time, ...]. |
memory_sequence_length |
(optional): Sequence lengths for the batch entries in memory. If provided, the memory tensor rows are masked with zeros for values past the respective sequence lengths. |
normalize |
Python boolean. Whether to normalize the energy term. |
sigmoid_noise |
Standard deviation of pre-sigmoid noise. See the docstring for '_monotonic_probability_fn' for more information. |
sigmoid_noise_seed |
(optional) Random seed for pre-sigmoid noise. |
score_bias_init |
Initial value for score bias scalar. It's recommended to initialize this to a negative value when the length of the memory is large. |
mode |
How to compute the attention distribution. Must be one of 'recursive', 'parallel', or 'hard'. See the docstring for tfa.seq2seq.monotonic_attention for more information. |
kernel_initializer |
(optional), the name of the initializer for the attention kernel. |
dtype |
The data type for the query and memory layers of the attention mechanism. |
name |
Name to use when creating ops. |
... |
A list that contains other common arguments for layer creation. |
Details
This type of attention enforces a monotonic constraint on the attention distributions; that is once the model attends to a given point in the memory it can't attend to any prior points at subsequence output timesteps. It achieves this by using the _monotonic_probability_fn instead of softmax to construct its attention distributions. Since the attention scores are passed through a sigmoid, a learnable scalar bias parameter is applied after the score function and before the sigmoid. Otherwise, it is equivalent to BahdanauAttention. This approach is proposed in
Colin Raffel, Minh-Thang Luong, Peter J. Liu, Ron J. Weiss, Douglas Eck, "Online and Linear-Time Attention by Enforcing Monotonic Alignments." ICML 2017. https://arxiv.org/abs/1704.00784
Value
None