.. _pytorch-ops-torch.nn.multihead-attention:

modelzoo.common.pytorch.layers.MultiheadAttention
=================================================

import path: ``modelzoo.common.pytorch.layers.MultiheadAttention``

**MultiheadAttention** (embed_dim, num_heads, dropout=0.0, bias=True, add_bias_kv=False, add_zero_attn=False, kdim=None, vdim=None, batch_first=False, device=None, dtype=None)

- **embed_dim** – Total dimension of the model.

- **nnum_heads** – Number of parallel attention heads. Note that ``embed_dim`` will be split across num_heads (i.e. each head will have dimension ``embed_dim // num_heads``).

- **dropout** – Dropout probability on ``attn_output_weights``. Default: 0.0 (no dropout).

- **batch_first** – If ``True``, then the input and output tensors are provided as (batch, seq, feature). Default: ``True`` (seq, batch, feature). We only support batch_first = True now.

- **bias** – If specified, adds bias to input / output projection layers. Default: ``True``. Replaced with ``use_projection_bias`` and ``use_ffn_bias``

- **add_bias_kv** – If specified, adds bias to the key and value sequences at ``dim=0``. Default: ``False``.

- **add_zero_attn** – If specified, adds a new batch of zeros to the key and value sequences at ``dim=1``. Default: ``False``.

- **kdim** – Total number of features for keys. Default: None (uses ``kdim=embed_dim``).

- **vdim** – Total number of features for values. Default: ``None`` (uses ``vdim=embed_dim``)

- **use_projection_bias** – Whether to use bias in the key, query, and value projections.

- **use_ffn_bias** – Whether to use bias in the output projection.

- **attention_initializer** – Projection kernel initializer. Defaults to ``xavier_uniform``.

- **output_layer_initializer** – If not ``None``, use this initializer for the output transform layer. Defaults to ``None``.

- **attention_type** – The attention variant to execute. Currently accepts ``dot_product`` and ``scaled_dot_product``. Defaults to ``scaled_dot_product``.

- **device** – The device to use for models parameters.

**forward** (query, key, value, key_padding_mask=None, need_weights=True, attn_mask=None, average_attn_weights=True, position_bias=None, rotary_position_embedding_helper=None)

- **q** (Tensor): Queries, shape ``[batch_size, seq_length, embed_dim]``.

- **k** (Tensor): Keys, shape ``[batch_size, seq_length, embed_dim]``.

- **v** (Tensor): Values, shape ``[batch_size, seq_length, embed_dim]``.

- **attn_mask** (Tensor): Attention mask. Can be 2D of shape ``[batch_size, seq_length]``, or 3D of shape ``[batch, query_length, seq_length]``.

- **key_padding_mask** (Tensor): If specified, a mask of shape (batch_size, seq_length) indicating which elements within key to ignore for the purpose of attention (i.e. treat as “padding”). Defaults to ``None``.

- **need_weights** (bool): If specified, returns ``attn_output_weights`` in addition to ``attn_outputs``. Default: ``False``.

- **average_attn_weights** (bool): If ``true``, indicates that the returned ``attn_weights`` should be averaged across heads. Otherwise, ``attn_weights`` are provided separately per head. Note that this flag only has an effect when ``need_weights=True``. Default: ``True`` (i.e. average weights across heads)

- **position_bias** (Tensor): Tensor containing position bias to apply in attention.

- **rotary_position_embedding_helper** (RotaryPositionEmbeddingHelper): Helper to create rotary embeddings according to the paper `RoFormer: Enhanced Transformer with Rotary Position Embedding <https://arxiv.org/abs/2104.09864>`_