modelzoo.common.pytorch.layers.MultiheadAttention#

import path: modelzoo.common.pytorch.layers.MultiheadAttention

MultiheadAttention (embed_dim, num_heads, dropout=0.0, bias=True, add_bias_kv=False, add_zero_attn=False, kdim=None, vdim=None, batch_first=False, device=None, dtype=None)

  • embed_dim – Total dimension of the model.

  • nnum_heads – Number of parallel attention heads. Note that embed_dim will be split across num_heads (i.e. each head will have dimension embed_dim // num_heads).

  • dropout – Dropout probability on attn_output_weights. Default: 0.0 (no dropout).

  • batch_first – If True, then the input and output tensors are provided as (batch, seq, feature). Default: True (seq, batch, feature). We only support batch_first = True now.

  • bias – If specified, adds bias to input / output projection layers. Default: True. Replaced with use_projection_bias and use_ffn_bias

  • add_bias_kv – If specified, adds bias to the key and value sequences at dim=0. Default: False.

  • add_zero_attn – If specified, adds a new batch of zeros to the key and value sequences at dim=1. Default: False.

  • kdim – Total number of features for keys. Default: None (uses kdim=embed_dim).

  • vdim – Total number of features for values. Default: None (uses vdim=embed_dim)

  • use_projection_bias – Whether to use bias in the key, query, and value projections.

  • use_ffn_bias – Whether to use bias in the output projection.

  • attention_initializer – Projection kernel initializer. Defaults to xavier_uniform.

  • output_layer_initializer – If not None, use this initializer for the output transform layer. Defaults to None.

  • attention_type – The attention variant to execute. Currently accepts dot_product and scaled_dot_product. Defaults to scaled_dot_product.

  • device – The device to use for models parameters.

forward (query, key, value, key_padding_mask=None, need_weights=True, attn_mask=None, average_attn_weights=True, position_bias=None, rotary_position_embedding_helper=None)

  • q (Tensor): Queries, shape [batch_size, seq_length, embed_dim].

  • k (Tensor): Keys, shape [batch_size, seq_length, embed_dim].

  • v (Tensor): Values, shape [batch_size, seq_length, embed_dim].

  • attn_mask (Tensor): Attention mask. Can be 2D of shape [batch_size, seq_length], or 3D of shape [batch, query_length, seq_length].

  • key_padding_mask (Tensor): If specified, a mask of shape (batch_size, seq_length) indicating which elements within key to ignore for the purpose of attention (i.e. treat as “padding”). Defaults to None.

  • need_weights (bool): If specified, returns attn_output_weights in addition to attn_outputs. Default: False.

  • average_attn_weights (bool): If true, indicates that the returned attn_weights should be averaged across heads. Otherwise, attn_weights are provided separately per head. Note that this flag only has an effect when need_weights=True. Default: True (i.e. average weights across heads)

  • position_bias (Tensor): Tensor containing position bias to apply in attention.

  • rotary_position_embedding_helper (RotaryPositionEmbeddingHelper): Helper to create rotary embeddings according to the paper RoFormer: Enhanced Transformer with Rotary Position Embedding