.. _pytorch-ops-torch.nn.transformer-decoder-layer: modelzoo.common.pytorch.layers.TransformerDecoderLayer ====================================================== import path: ``modelzoo.common.pytorch.layers.TransformerDecoderLayer`` **TransformerDecoderLayer** (d_model, nhead, dim_feedforward=2048, dropout=0.1, activation=”gelu, layer_norm_eps=1e-05, batch_first=True, norm_first=False, device=None, add_cross_attention=True, attention_dropout_rate=None, attention_type="scaled_dot_product", use_projection_bias_in_attention=False, use_ffn_bias_in_attention=False, use_ffn_bias=False, attention_initializer="xavier_uniform", ffn_initializer="xavier_uniform"): - **d_model**: the number of expected features in the input (required). - **nhead**: the number of heads in the multihead attention models (required). - **dim_feedforward**: the dimension of the feedforward network model (default=2048). - **dropout**: the dropout value (default=0.1). - **activation**: the activation function of the intermediate layer, can be a string ("relu" or "gelu") or a unary callable. Default: gelu - **layer_norm_eps**: the eps value in layer normalization components (``default=1e-5``). - **batch_first**: If ``True``, then the input and output tensors are provided as (``batch``, ``seq``, ``feature``). Default: ``False`` (``seq``, ``batch``, ``feature``). We only support ``batch_first = True`` now. - **norm_first**: if ``True``, layer norm is done prior to attention and feedforward operations, respectively. Otherwise it's done after. Default: ``False`` (after). - **attention_dropout_rate**: Attention dropout rate. If ``None``, defaults to dropout. - **use_projection_bias_in_attention**: Add bias to Q, K, V projections in the Attention layer. Defaults to ``False``. - **device** – The device to use for models parameters. - **add_cross_attention** – If ``True``, adds cross-attention layer between encoder/decoder; otherwise, only self-attention is used in the decoder (GPT-style models should set to ``False``) - **attention_type**: Should be in [``scaled_dot_product``, ``dot_product``] - **add_cross_attention**: If ``True``, adds cross-attention layer between encoder/decoder, otherwise, only self-attention is used in the decoder (GPT-style models should set to ``False``) - **use_ffn_bias_in_attention**: Add bias in the concluding FFN in the Attention layer. Defaults to ``False``. - **use_ffn_bias**: Add bias in all dense layers of the decoder's ffn sublayer - **attention_initializer**: Attention layer initializer. Defaults to ``xavier_uniform``. - **ffn_initializer**: FFN layer initializer. Defaults to ``xavier_uniform``. **forward** (tgt=None, memory=None, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, rotary_position_embedding_helper=None): - **tgt**: the sequence to the decoder layer (required). shape ``[batch_size, tgt_seq_length, embed_dim]``. - **memory**: the sequence from the last layer of the encoder (required). shape ``[batch_size, memory_length, embed_dim]``. - **tgt_mask**: the mask for the tgt sequence (optional). shape ``[tgt_seq_length, tgt_seq_length]``. - **memory_mask**: the mask for the memory sequence (optional). shape ``[memory_length, src_seq_length]``. - **tgt_key_padding_mask**: the mask for the tgt keys per batch (optional). shape ``[batch_size, tgt_seq_length]``. - **memory_key_padding_mask**: the mask for the memory keys per batch (optional). shape ``[batch_size, memory_length]``. - **rotary_position_embedding_helper** (RotaryPositionEmbeddingHelper): Helper to create rotary embeddings according to the paper `RoFormer: Enhanced Transformer with Rotary Position Embedding `_