.. _pytorch-ops-torch.nn.transformer-decoder-layer:

modelzoo.common.pytorch.layers.TransformerDecoderLayer
======================================================

import path: ``modelzoo.common.pytorch.layers.TransformerDecoderLayer``

**TransformerDecoderLayer** (d_model, nhead, dim_feedforward=2048, dropout=0.1, activation=”gelu, layer_norm_eps=1e-05, batch_first=True, norm_first=False, device=None, add_cross_attention=True, attention_dropout_rate=None, attention_type="scaled_dot_product", use_projection_bias_in_attention=False, use_ffn_bias_in_attention=False, use_ffn_bias=False, attention_initializer="xavier_uniform", ffn_initializer="xavier_uniform"):

- **d_model**: the number of expected features in the input (required).

- **nhead**: the number of heads in the multihead attention models (required).

- **dim_feedforward**: the dimension of the feedforward network model (default=2048).

- **dropout**: the dropout value (default=0.1).

- **activation**: the activation function of the intermediate layer, can be a string ("relu" or "gelu") or a unary callable. Default: gelu

- **layer_norm_eps**: the eps value in layer normalization components (``default=1e-5``).

- **batch_first**: If ``True``, then the input and output tensors are provided as (``batch``, ``seq``, ``feature``). Default: ``False`` (``seq``, ``batch``, ``feature``). We only support ``batch_first = True`` now.

- **norm_first**: if ``True``, layer norm is done prior to attention and feedforward operations, respectively. Otherwise it's done after. Default: ``False`` (after).

- **attention_dropout_rate**: Attention dropout rate. If ``None``, defaults to dropout.

- **use_projection_bias_in_attention**: Add bias to Q, K, V projections in the Attention layer. Defaults to ``False``.

- **device** – The device to use for models parameters.

- **add_cross_attention** – If ``True``, adds cross-attention layer between encoder/decoder; otherwise, only self-attention is used in the decoder (GPT-style models should set to ``False``)

- **attention_type**: Should be in [``scaled_dot_product``, ``dot_product``]

- **add_cross_attention**: If ``True``, adds cross-attention layer between encoder/decoder, otherwise, only self-attention is used in the decoder (GPT-style models should set to ``False``)

- **use_ffn_bias_in_attention**: Add bias in the concluding FFN in the Attention layer. Defaults to ``False``.

- **use_ffn_bias**: Add bias in all dense layers of the decoder's ffn sublayer

- **attention_initializer**: Attention layer initializer. Defaults to ``xavier_uniform``.

- **ffn_initializer**: FFN layer initializer. Defaults to ``xavier_uniform``.

**forward** (tgt=None, memory=None, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, rotary_position_embedding_helper=None):

- **tgt**: the sequence to the decoder layer (required). shape ``[batch_size, tgt_seq_length, embed_dim]``.

- **memory**: the sequence from the last layer of the encoder (required). shape ``[batch_size, memory_length, embed_dim]``.

- **tgt_mask**: the mask for the tgt sequence (optional). shape ``[tgt_seq_length, tgt_seq_length]``.

- **memory_mask**: the mask for the memory sequence (optional). shape ``[memory_length, src_seq_length]``.

- **tgt_key_padding_mask**: the mask for the tgt keys per batch (optional). shape ``[batch_size, tgt_seq_length]``.

- **memory_key_padding_mask**: the mask for the memory keys per batch (optional). shape ``[batch_size, memory_length]``.

- **rotary_position_embedding_helper** (RotaryPositionEmbeddingHelper): Helper to create rotary embeddings according to the paper `RoFormer: Enhanced Transformer with Rotary Position Embedding <https://arxiv.org/abs/2104.09864>`_