common.pytorch.layers package#


common.pytorch.layers.AlibiPositionEmbeddingLayer module#

class common.pytorch.layers.AlibiPositionEmbeddingLayer.AlibiPositionEmbeddingLayer#

Bases: torch.nn.Module

Alibi Position Embedding Layer alibi bias as in paper: :param num_heads: number of attention heads. :type num_heads: int :param slopes: slope values to use for alibi heads. Shape: [num_heads, 1]. Default to None. :type slopes: Tensor :param alibi_trainable_slopes: whether the alibi slopes are trainable parameters. :type alibi_trainable_slopes: bool :param bidirectional_relative_attention: whether alibi attention is bidirectional. :type bidirectional_relative_attention: bool :param slopes_initializer: initializer for alibi slopes if it’s trainable. Defaults to xavier_uniform. :type slopes_initializer: str :param alibi_implementation: variant name for alibi implementation. Currently

accepts embedding and expand. Defaults to embedding.


Relative position bias, to be used in attention masking

Return type

position_bias (Tensor)

__init__(num_heads, slopes=None, alibi_trainable_slopes=False, bidirectional_relative_attention=False, slopes_initializer='xavier_uniform', alibi_implementation='embedding')#
forward(seq_length, key_length, past_kv=None)#

common.pytorch.layers.AttentionHelper module#

common.pytorch.layers.AttentionHelper.get_attention_module(attn_type: str, extra_params)#

This function retrieves the attention module according to attn_type and checks if provided extra_params is correctly related to the attention module

common.pytorch.layers.AttentionLayer module#

class common.pytorch.layers.AttentionLayer.MultiheadAttention#

Bases: torch.nn.Module

Multi-head attention layer. Adapted from:

  • embed_dim (int) – Number of input units in each projection output

  • num_heads (int) – Number of attention heads.

  • inner_dim (int) – Number of output units in attention query/key/value projection. Defaults to embed_dim.

  • dropout (float) – Dropout rate for key-query weights. Defaults to 0.0.

  • batch_first (bool) – If True, then the input and output tensors are provided as (batch, seq, feature), otherwise the format will be (seq, batch, feature). Default: True (batch, seq, feature).

  • add_bias_kv (bool) – If specified, adds bias to the key and value sequences at dim=0. Default: False.

  • add_zero_attn (bool) – If specified, adds a new batch of zeros to the key and value sequences at dim=1. Default: False

  • kdim (int) – Number of output units in key projection

  • vdim (int) – Number of output units in projection

  • use_projection_bias (bool) – Whether to use bias in the key, query, and value projections.

  • use_ffn_bias (bool) – Whether to use bias in the output projection.

  • attention_initializer (str) – Projection kernel initializer. Defaults to xavier_uniform.

  • attention_q_initializer – Query projection kernel initializer. If not specified, the query will be initialized via attention_initializer

  • output_layer_initializer (str or initializer) – If not None, use this initializer for the output transform layer. Defaults to None.

  • bias_initializer (str) – Bias initializer. Defaults to zeros.

  • attention_type (str) – The attention variant to execute. Currently accepts dot_product and scaled_dot_product. Defaults to scaled_dot_product.

  • softmax_dtype_fp32 (bool) – Use an FP32 softmax implementation.

  • attention_kernel (str | None) –

    Kernel to use. Defaults to None - compiler selects the kernel. See accepted values below.

    default - Default implementation. optimized_beta - Optimized implementation. Beta feature, support is limited.

__init__(embed_dim, num_heads, inner_dim=None, dropout=0.0, batch_first=True, add_bias_kv=False, add_zero_attn=False, kdim=None, vdim=None, use_projection_bias=None, use_ffn_bias=False, attention_initializer='xavier_uniform', attention_q_initializer=None, output_layer_initializer=None, bias_initializer='zeros', attention_type='scaled_dot_product', softmax_dtype_fp32=True, attention_kernel: Optional[str] = None, device=None)#
apply_attention_bias(logits, attention_bias)#
apply_position_bias(logits, position_bias)#
apply_rotary_position_embedding(vector, rotary_position_embedding_helper, real_seq_length, offset_length)#
calculate_attention_logits(q, k)#
calculate_attention_output(attention_scores, v)#
combine_masks(attn_mask_reshaped, key_padding_mask_reshaped)#
construct_key_vector(k, attn_mask=None, key_padding_mask=None)#
construct_present_kv(cache_present_kv, k, v)#
construct_query_vector(q, attn_mask=None, key_padding_mask=None)#
construct_value_vector(v, attn_mask=None, key_padding_mask=None)#
forward(q, k, v, attn_mask=None, key_padding_mask=None, need_weights=False, average_attn_weights=True, past_kv=None, cache_present_kv=False, past_kv_self_attn=True, position_bias=None, rotary_position_embedding_helper=None)#

Applies the attention mechanism to queries q, keys k and values v.

  • q (Tensor) – Queries, shape [batch_size, seq_length, embed_dim].

  • k (Tensor) – Keys, shape [batch_size, seq_length, embed_dim].

  • v (Tensor) – Values, shape [batch_size, seq_length, embed_dim].

  • attn_mask (Tensor) – Attention mask. Can be 2D of shape [batch_size, seq_length], or 3D of shape [batch, query_length, seq_length].

  • key_padding_mask (Tensor) – If specified, a mask of shape (N, S) indicating which elements within key to ignore for the purpose of attention (i.e. treat as “padding”). Defaults to None.

  • need_weights (bool) – If specified, returns attn_output_weights in addition to attn_outputs. Default: False.

  • average_attn_weights (bool) – If true, indicates that the returned attn_weights should be averaged across heads. Otherwise, attn_weights are provided separately per head. Note that this flag only has an effect when need_weights=True. Default: True (i.e. average weights across heads)

  • past_kv (tuple(tensor, tensor)) – Past keys and values. Tensors have shape [batch_size, num_heads, seq_length, embed_dim / num_heads]. The 0th and 1st tensor contain the past keys and values, respectively. Defaults to None.

  • cache_present_kv (bool) – Specifies if the present keys and values must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. Defaults to False.

  • past_kv_self_attn (bool) – Specifies whether the past keys & values should be used for self-attention (true) of cross-attention (false). Ignored if past_kv is not provided. Default: True

  • position_bias (Tensor) – Tensor containing position bias to apply in attention.

  • rotary_position_embedding_helper (Optional[RotaryPositionEmbeddingHelper]) – A helper class to apply rotary embedding on the input tensor.


If cache_present_kv is False, no entry for present keys and values is provided.

get_sequence_length(past_kv, real_seq_length)#
process_attention_mask(attn_mask, past_kv, q)#
process_key_padding_mask(key_padding_mask, attn_mask, past_kv, q)#
process_past_kv(past_kv, past_kv_self_attn, k, v)#

common.pytorch.layers.BCELoss module#

class common.pytorch.layers.BCELoss.BCELoss#

Bases: torch.nn.BCELoss

common.pytorch.layers.BCEWithLogitsLoss module#

class common.pytorch.layers.BCEWithLogitsLoss.BCEWithLogitsLoss#

Bases: torch.nn.BCEWithLogitsLoss

common.pytorch.layers.BiaslessLayerNorm module#

class common.pytorch.layers.BiaslessLayerNorm.BiaslessLayerNorm#

Bases: torch.nn.Module

Construct a layernorm module in the T5 style No bias and no subtraction of mean.

__init__(hidden_size, eps=1e-06, device=None)#

Construct a layernorm module in the T5 style No bias and no subtraction of mean.


common.pytorch.layers.CosineEmbeddingLoss module#

class common.pytorch.layers.CosineEmbeddingLoss.CosineEmbeddingLoss#

Bases: torch.nn.CosineEmbeddingLoss

common.pytorch.layers.CrossEntropyLoss module#

class common.pytorch.layers.CrossEntropyLoss.CrossEntropyLoss#

Bases: torch.nn.CrossEntropyLoss

common.pytorch.layers.EmbeddingLayer module#

class common.pytorch.layers.EmbeddingLayer.EmbeddingLayer#

Bases: torch.nn.Module

Creates token and, optionally, position and segment embeddings.

  • vocab_size (int) – Size of input vocabulary.

  • embedding_size (int) – Dimension of the embedding space.

  • pad_token_id (Optional[int]) – If specified, the entries at padding_idx do not contribute to the gradient; therefore, the embedding vector at padding_idx is not updated during training.

  • segment_embedding_size (int) – Dimension of the embedding space for segment embeddings. Useful when factorized embeddings are used for tokens and so the size of the embedding space for segments differs from that for tokens. Defaults to the same value as embedding_size.

  • embeddings_initializer (Optional[str,Callable]) – Token embeddings initializer. Defaults to ‘uniform’.

  • max_position_embeddings (int) – Maximum sequence length to train using model.

  • position_embedding_type (str) – ‘learned’, ‘fixed’ or ‘rotary’. Defaults to “learned”, for ‘rotary’ embeddings, embeddings are not created at bottom but computed with key&query embeddings by RotaryPositionEmbeddingHelper

  • min_timescale (Optional[int]) – The scale of the shortest sinusoid. Default to 1.0. (only need to be specified when position_embedding_type is fixed).

  • max_timescale (Optional[int]) – The scale of the longest sinusoid. Default to 1.0e4. (only need to be specified when position_embedding_type is fixed).

  • position_embeddings_initializer (Optional[str,Callable]) – Position embeddings initializer. Defaults to “uniform”.

  • num_segments (Optional[int]) – Number of segments for the segment embedding layer. Defaults to None, in which case the segment embedding layer is not created.

  • segment_embeddings_initializer (Optional[str,Callable]) – Segment embeddings initializer. Defaults to “uniform”.

__init__(vocab_size, embedding_size, pad_token_id=None, positional_embedding_size=None, segment_embedding_size=None, embeddings_initializer='uniform', max_position_embeddings=None, position_embedding_type='learned', min_timescale=1.0, max_timescale=10000.0, position_embeddings_initializer='uniform', num_segments=None, segment_embeddings_initializer='uniform', device=None)#
compute_positional_embeddings(input_ids, past_length=0, dtype=None)#
create_fix_pos_embedding(seq_len, embed_len, min_timescale, max_timescale)#

adapted from: /1843c72d1d5faf4c085bb198b5dde0908f4081d0/tensor2tensor/layers /

forward(input_ids, segment_ids=None, past_length=0)#
position_embedding_helper(num_heads=None, relative_attention_bias=None, num_relative_attention_buckets=32, bidirectional=False, initializer='xavier_uniform', rotary_dim=None)#

common.pytorch.layers.FeedForwardNetwork module#

class common.pytorch.layers.FeedForwardNetwork.FeedForwardNetwork#

Bases: torch.nn.Module

A feed forward network that consists of a stack of fully connected layers arranged as [LinearLayer -> Activation -> Dropout] block repeated len(layers_units) times.

  • input_unit (int) – integer for number of in_features of input.

  • layers_units (int) – List of units for each layer.

  • layers_activation (str) – List of activation types (str) for each layer.

  • layers_dropout_rates (float) – List of dropout rates (float) for each layer.

  • use_bias (bool) – If True, use bias throughout all layers.

  • kernel_initializer – Kernel initializer. Defaults to “xavier_uniform”.

  • bias_initializer – Bias initializer. Defaults to “zeros”.

  • output_layer_initializer – If not None, initialize the last projection layer with this initializer. Defaults to None.

Initialize the FFN object instance.

__init__(input_unit, layers_units, layers_activation=None, layers_dropout_rates=None, use_bias=False, kernel_initializer='xavier_uniform', bias_initializer='zeros', output_layer_initializer=None, device=None)#

Initialize the FFN object instance.

class common.pytorch.layers.FeedForwardNetwork.SingleFeedForwardLayer#

Bases: torch.nn.Module

Initialize Single FFN layer instance.

__init__(in_features, out_features, use_bias=False, activation=None, dropout=None, device=None)#

common.pytorch.layers.GPTJDecoderLayer module#

common.pytorch.layers.GaussianNLLLoss module#

class common.pytorch.layers.GaussianNLLLoss.GaussianNLLLoss#

Bases: torch.nn.Module

__init__(*args: Any, **kwargs: Any) None#

common.pytorch.layers.HingeEmbeddingLoss module#

class common.pytorch.layers.HingeEmbeddingLoss.HingeEmbeddingLoss#

Bases: torch.nn.HingeEmbeddingLoss

common.pytorch.layers.HuberLoss module#

class common.pytorch.layers.HuberLoss.HuberLoss#

Bases: torch.nn.Module

__init__(*args: Any, **kwargs: Any) None#

common.pytorch.layers.KLDivLoss module#

class common.pytorch.layers.KLDivLoss.KLDivLoss#

Bases: torch.nn.KLDivLoss

common.pytorch.layers.L1Loss module#

class common.pytorch.layers.L1Loss.L1Loss#

Bases: torch.nn.L1Loss

common.pytorch.layers.MSELoss module#

class common.pytorch.layers.MSELoss.MSELoss#

Bases: torch.nn.MSELoss

common.pytorch.layers.MarginRankingLoss module#

class common.pytorch.layers.MarginRankingLoss.MarginRankingLoss#

Bases: torch.nn.MarginRankingLoss

common.pytorch.layers.MultiLabelSoftMarginLoss module#

class common.pytorch.layers.MultiLabelSoftMarginLoss.MultiLabelSoftMarginLoss#

Bases: torch.nn.MultiLabelSoftMarginLoss

common.pytorch.layers.MultiMarginLoss module#

class common.pytorch.layers.MultiMarginLoss.MultiMarginLoss#

Bases: torch.nn.Module

__init__(*args: Any, **kwargs: Any) None#

common.pytorch.layers.NLLLoss module#

class common.pytorch.layers.NLLLoss.NLLLoss#

Bases: torch.nn.NLLLoss

common.pytorch.layers.PoissonNLLLoss module#

class common.pytorch.layers.PoissonNLLLoss.PoissonNLLLoss#

Bases: torch.nn.PoissonNLLLoss

common.pytorch.layers.RelativePositionEmbeddingLayer module#

class common.pytorch.layers.RelativePositionEmbeddingLayer.RelativePositionEmbeddingLayer#

Bases: torch.nn.Module

Relative Position Embedding Layer

  • relative_attention_bias (Tensor) – Tensor with relative attention weights. Shape: [num_relative_attention_buckets, num_heads]. Defaults set to None.

  • num_relative_attention_buckets (int) – Number of buckets used to calculate relative position bias. Default: 32

  • max_relative_positions (int) – The maximum relative distance used when calculating relative position buckets. See relative_position_bucket docs for more details. Default: 128

  • bidirectional_relative_attention (bool) – Whether attention is bidirectional.

  • allow_negative_buckets (bool) – If enabled, position buckets will be both positive and negative (as required by certain models like DEBERTA). Default: False.

  • relative_attn_bias_initializer (bool) – Relative Attention bias initializer. Defaults to xavier_uniform.


Relative position bias, to be used in attention masking

Return type

position_bias (Tensor)

__init__(num_heads, relative_attention_bias=None, num_relative_attention_buckets=32, max_relative_positions=128, bidirectional_relative_attention=False, allow_negative_buckets=False, relative_attn_bias_initializer='xavier_uniform')#
static compute_raw_relative_positions(query_length, key_length, device=None)#
compute_relative_positions(query_length, key_length)#
forward(seq_length, key_length, past_kv=None)#
static relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128, allow_negative_buckets=False)#

Translate relative position to a bucket number for relative attention. The relative position is defined as memory_position - query_position, i.e. the distance in tokens from the attending position to the attended-to position. If bidirectional_relative_attention = False, then positive relative positions are invalid. We use smaller buckets for small absolute relative positions and larger buckets for larger absolute relative positions. All relative positions >= max_distance map to the same bucket. All relative positions <= -max_distance map to the same bucket. This should allow for more graceful generalization to longer sequences than the model has been trained on. :param relative_position: Tensor with relative positions. :type relative_position: Tensor :param bidirectional: Whether attention is bidirectional :type bidirectional: bool :param num_buckets: number of buckets for relative positions :type num_buckets: int :param max_distance: Used in order to calculate relative position buckets. :type max_distance: int :param allow_negative_buckets: If enabled, position buckets will be both positive

and negative (as required by certain models like DEBERTA). Default: False.


a Tensor with the same shape as relative_position, containing int32 values in the range [0, num_relative_attention_buckets).


common.pytorch.layers.SmoothL1Loss module#

class common.pytorch.layers.SmoothL1Loss.SmoothL1Loss#

Bases: torch.nn.Module

__init__(*args: Any, **kwargs: Any) None#

common.pytorch.layers.Transformer module#

common.pytorch.layers.TransformerDecoder module#

common.pytorch.layers.TransformerDecoderLayer module#

common.pytorch.layers.TransformerEncoder module#

common.pytorch.layers.TransformerEncoderLayer module#

common.pytorch.layers.TripletMarginLoss module#

class common.pytorch.layers.TripletMarginLoss.TripletMarginLoss#

Bases: torch.nn.TripletMarginLoss

common.pytorch.layers.TripletMarginWithDistanceLoss module#

class common.pytorch.layers.TripletMarginWithDistanceLoss.TripletMarginWithDistanceLoss#

Bases: torch.nn.TripletMarginWithDistanceLoss

common.pytorch.layers.utils module#

common.pytorch.layers.utils.apply_loss_reduction(loss, reduction)#
common.pytorch.layers.utils.apply_position_bias(embedding_helper, seq_length, key_length, past_kv=None)#

Module contents#