nn_multihead_attention {torch}R Documentation

MultiHead attention

Description

Allows the model to jointly attend to information from different representation subspaces. See reference: Attention Is All You Need

Usage

nn_multihead_attention(
  embed_dim,
  num_heads,
  dropout = 0,
  bias = TRUE,
  add_bias_kv = FALSE,
  add_zero_attn = FALSE,
  kdim = NULL,
  vdim = NULL,
  batch_first = FALSE
)

Arguments

embed_dim

total dimension of the model.

num_heads

parallel attention heads. Note that embed_dim will be split across num_heads (i.e. each head will have dimension embed_dim %/% num_heads).

dropout

a Dropout layer on attn_output_weights. Default: 0.0.

bias

add bias as module parameter. Default: True.

add_bias_kv

add bias to the key and value sequences at dim=0.

add_zero_attn

add a new batch of zeros to the key and value sequences at dim=1.

kdim

total number of features in key. Default: NULL

vdim

total number of features in value. Default: NULL. Note: if kdim and vdim are NULL, they will be set to embed_dim such that query, key, and value have the same number of features.

batch_first

if TRUE then the input and output tensors are (N, S, E) instead of (S, N, E), where N is the batch size, S is the sequence length, and E is the embedding dimension.

Details

\mbox{MultiHead}(Q, K, V) = \mbox{Concat}(head_1,\dots,head_h)W^O \mbox{where} head_i = \mbox{Attention}(QW_i^Q, KW_i^K, VW_i^V)

Shape

Inputs:

Outputs:

Examples

if (torch_is_installed()) {
## Not run: 
multihead_attn <- nn_multihead_attention(embed_dim, num_heads)
out <- multihead_attn(query, key, value)
attn_output <- out[[1]]
attn_output_weights <- out[[2]]

## End(Not run)

}

[Package torch version 0.12.0 Index]