modelzoo.common.pytorch.layers.TransformerDecoderLayer
modelzoo.common.pytorch.layers.TransformerDecoderLayer#
import path: modelzoo.common.pytorch.layers.TransformerDecoderLayer
TransformerDecoderLayer (d_model, nhead, dim_feedforward=2048, dropout=0.1, activation=”gelu, layer_norm_eps=1e-05, batch_first=True, norm_first=False, device=None, add_cross_attention=True, attention_dropout_rate=None, attention_type=”scaled_dot_product”, use_projection_bias_in_attention=False, use_ffn_bias_in_attention=False, use_ffn_bias=False, attention_initializer=”xavier_uniform”, ffn_initializer=”xavier_uniform”):
d_model: the number of expected features in the input (required).
nhead: the number of heads in the multihead attention models (required).
dim_feedforward: the dimension of the feedforward network model (default=2048).
dropout: the dropout value (default=0.1).
activation: the activation function of the intermediate layer, can be a string (“relu” or “gelu”) or a unary callable. Default: gelu
layer_norm_eps: the eps value in layer normalization components (
default=1e-5
).batch_first: If
True
, then the input and output tensors are provided as (batch
,seq
,feature
). Default:False
(seq
,batch
,feature
). We only supportbatch_first = True
now.norm_first: if
True
, layer norm is done prior to attention and feedforward operations, respectively. Otherwise it’s done after. Default:False
(after).attention_dropout_rate: Attention dropout rate. If
None
, defaults to dropout.use_projection_bias_in_attention: Add bias to Q, K, V projections in the Attention layer. Defaults to
False
.device – The device to use for models parameters.
add_cross_attention – If
True
, adds cross-attention layer between encoder/decoder; otherwise, only self-attention is used in the decoder (GPT-style models should set toFalse
)attention_type: Should be in [
scaled_dot_product
,dot_product
]add_cross_attention: If
True
, adds cross-attention layer between encoder/decoder, otherwise, only self-attention is used in the decoder (GPT-style models should set toFalse
)use_ffn_bias_in_attention: Add bias in the concluding FFN in the Attention layer. Defaults to
False
.use_ffn_bias: Add bias in all dense layers of the decoder’s ffn sublayer
attention_initializer: Attention layer initializer. Defaults to
xavier_uniform
.ffn_initializer: FFN layer initializer. Defaults to
xavier_uniform
.
forward (tgt=None, memory=None, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, rotary_position_embedding_helper=None):
tgt: the sequence to the decoder layer (required). shape
[batch_size, tgt_seq_length, embed_dim]
.memory: the sequence from the last layer of the encoder (required). shape
[batch_size, memory_length, embed_dim]
.tgt_mask: the mask for the tgt sequence (optional). shape
[tgt_seq_length, tgt_seq_length]
.memory_mask: the mask for the memory sequence (optional). shape
[memory_length, src_seq_length]
.tgt_key_padding_mask: the mask for the tgt keys per batch (optional). shape
[batch_size, tgt_seq_length]
.memory_key_padding_mask: the mask for the memory keys per batch (optional). shape
[batch_size, memory_length]
.rotary_position_embedding_helper (RotaryPositionEmbeddingHelper): Helper to create rotary embeddings according to the paper RoFormer: Enhanced Transformer with Rotary Position Embedding