common.pytorch.layers package#
Submodules#
common.pytorch.layers.AlibiPositionEmbeddingLayer module#
- class common.pytorch.layers.AlibiPositionEmbeddingLayer.AlibiPositionEmbeddingLayer#
Bases:
torch.nn.Module
Alibi Position Embedding Layer alibi bias as in paper: https://arxiv.org/abs/2108.12409 :param num_heads: number of attention heads. :type num_heads: int :param slopes: slope values to use for alibi heads. Shape: [num_heads, 1]. Default to None. :type slopes: Tensor :param alibi_trainable_slopes: whether the alibi slopes are trainable parameters. :type alibi_trainable_slopes: bool :param bidirectional_relative_attention: whether alibi attention is bidirectional. :type bidirectional_relative_attention: bool :param slopes_initializer: initializer for alibi slopes if it’s trainable. Defaults to
xavier_uniform
. :type slopes_initializer: str :param alibi_implementation: variant name for alibi implementation. Currentlyaccepts
embedding
andexpand
. Defaults toembedding
.- Returns
Relative position bias, to be used in attention masking
- Return type
position_bias (Tensor)
- __init__(num_heads, slopes=None, alibi_trainable_slopes=False, bidirectional_relative_attention=False, slopes_initializer='xavier_uniform', alibi_implementation='embedding')#
- forward(seq_length, key_length, past_kv=None)#
- reset_parameters()#
common.pytorch.layers.AttentionHelper module#
- common.pytorch.layers.AttentionHelper.get_attention_module(attn_type: str, extra_params)#
This function retrieves the attention module according to attn_type and checks if provided extra_params is correctly related to the attention module
common.pytorch.layers.AttentionLayer module#
- class common.pytorch.layers.AttentionLayer.MultiheadAttention#
Bases:
torch.nn.Module
Multi-head attention layer. Adapted from: https://pytorch.org/docs/stable/_modules/torch/nn/modules/activation.html#MultiheadAttention
- Parameters
embed_dim (int) – Number of input units in each projection output
num_heads (int) – Number of attention heads.
inner_dim (int) – Number of output units in attention query/key/value projection. Defaults to
embed_dim
.dropout (float) – Dropout rate for key-query weights. Defaults to 0.0.
batch_first (bool) – If True, then the input and output tensors are provided as (batch, seq, feature), otherwise the format will be (seq, batch, feature). Default: True (batch, seq, feature).
add_bias_kv (bool) – If specified, adds bias to the key and value sequences at dim=0. Default: False.
add_zero_attn (bool) – If specified, adds a new batch of zeros to the key and value sequences at dim=1. Default: False
kdim (int) – Number of output units in key projection
vdim (int) – Number of output units in projection
use_projection_bias (bool) – Whether to use bias in the key, query, and value projections.
use_ffn_bias (bool) – Whether to use bias in the output projection.
attention_initializer (str) – Projection kernel initializer. Defaults to
xavier_uniform
.attention_q_initializer – Query projection kernel initializer. If not specified, the query will be initialized via
attention_initializer
output_layer_initializer (str or initializer) – If not None, use this initializer for the output transform layer. Defaults to None.
bias_initializer (str) – Bias initializer. Defaults to
zeros
.attention_type (str) – The attention variant to execute. Currently accepts
dot_product
andscaled_dot_product
. Defaults toscaled_dot_product
.softmax_dtype_fp32 (bool) – Use an FP32 softmax implementation.
attention_kernel (str | None) –
Kernel to use. Defaults to None - compiler selects the kernel. See accepted values below.
default
- Default implementation.optimized_beta
- Optimized implementation. Beta feature, support is limited.
- __init__(embed_dim, num_heads, inner_dim=None, dropout=0.0, batch_first=True, add_bias_kv=False, add_zero_attn=False, kdim=None, vdim=None, use_projection_bias=None, use_ffn_bias=False, attention_initializer='xavier_uniform', attention_q_initializer=None, output_layer_initializer=None, bias_initializer='zeros', attention_type='scaled_dot_product', softmax_dtype_fp32=True, attention_kernel: Optional[str] = None, device=None)#
- apply_attention_bias(logits, attention_bias)#
- apply_position_bias(logits, position_bias)#
- apply_rotary_position_embedding(vector, rotary_position_embedding_helper, real_seq_length, offset_length)#
- calculate_attention_logits(q, k)#
- calculate_attention_output(attention_scores, v)#
- calculate_attention_scores(logits)#
- check_extra_params()#
- combine_masks(attn_mask_reshaped, key_padding_mask_reshaped)#
- construct_key_vector(k, attn_mask=None, key_padding_mask=None)#
- construct_present_kv(cache_present_kv, k, v)#
- construct_query_vector(q, attn_mask=None, key_padding_mask=None)#
- construct_value_vector(v, attn_mask=None, key_padding_mask=None)#
- forward(q, k, v, attn_mask=None, key_padding_mask=None, need_weights=False, average_attn_weights=True, past_kv=None, cache_present_kv=False, past_kv_self_attn=True, position_bias=None, rotary_position_embedding_helper=None)#
Applies the attention mechanism to queries
q
, keysk
and valuesv
.- Parameters
q (Tensor) – Queries, shape
[batch_size, seq_length, embed_dim]
.k (Tensor) – Keys, shape
[batch_size, seq_length, embed_dim]
.v (Tensor) – Values, shape
[batch_size, seq_length, embed_dim]
.attn_mask (Tensor) – Attention mask. Can be 2D of shape
[batch_size, seq_length]
, or 3D of shape[batch, query_length, seq_length]
.key_padding_mask (Tensor) – If specified, a mask of shape (N, S) indicating which elements within key to ignore for the purpose of attention (i.e. treat as “padding”). Defaults to None.
need_weights (bool) – If specified, returns attn_output_weights in addition to attn_outputs. Default: False.
average_attn_weights (bool) – If true, indicates that the returned attn_weights should be averaged across heads. Otherwise, attn_weights are provided separately per head. Note that this flag only has an effect when need_weights=True. Default: True (i.e. average weights across heads)
past_kv (tuple(tensor, tensor)) – Past keys and values. Tensors have shape
[batch_size, num_heads, seq_length, embed_dim / num_heads]
. The 0th and 1st tensor contain the past keys and values, respectively. Defaults toNone
.cache_present_kv (bool) – Specifies if the present keys and values must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. Defaults to
False
.past_kv_self_attn (bool) – Specifies whether the past keys & values should be used for self-attention (true) of cross-attention (false). Ignored if past_kv is not provided. Default: True
position_bias (Tensor) – Tensor containing position bias to apply in attention.
rotary_position_embedding_helper (Optional[RotaryPositionEmbeddingHelper]) – A helper class to apply rotary embedding on the input tensor.
- Returns
If
cache_present_kv
isFalse
, no entry for present keys and values is provided.
- get_sequence_length(past_kv, real_seq_length)#
- process_attention_mask(attn_mask, past_kv, q)#
- process_k_before_logits_calc(k)#
- process_key_padding_mask(key_padding_mask, attn_mask, past_kv, q)#
- process_past_kv(past_kv, past_kv_self_attn, k, v)#
- process_q_before_logits_calc(q)#
- process_v_before_logits_calc(v)#
- reset_parameters()#
common.pytorch.layers.BCELoss module#
- class common.pytorch.layers.BCELoss.BCELoss#
Bases:
torch.nn.BCELoss
common.pytorch.layers.BCEWithLogitsLoss module#
- class common.pytorch.layers.BCEWithLogitsLoss.BCEWithLogitsLoss#
Bases:
torch.nn.BCEWithLogitsLoss
common.pytorch.layers.BiaslessLayerNorm module#
- class common.pytorch.layers.BiaslessLayerNorm.BiaslessLayerNorm#
Bases:
torch.nn.Module
Construct a layernorm module in the T5 style No bias and no subtraction of mean.
- __init__(hidden_size, eps=1e-06, device=None)#
Construct a layernorm module in the T5 style No bias and no subtraction of mean.
- forward(hidden_states)#
common.pytorch.layers.CosineEmbeddingLoss module#
- class common.pytorch.layers.CosineEmbeddingLoss.CosineEmbeddingLoss#
Bases:
torch.nn.CosineEmbeddingLoss
common.pytorch.layers.CrossEntropyLoss module#
- class common.pytorch.layers.CrossEntropyLoss.CrossEntropyLoss#
Bases:
torch.nn.CrossEntropyLoss
common.pytorch.layers.EmbeddingLayer module#
- class common.pytorch.layers.EmbeddingLayer.EmbeddingLayer#
Bases:
torch.nn.Module
Creates token and, optionally, position and segment embeddings.
- Parameters
vocab_size (int) – Size of input vocabulary.
embedding_size (int) – Dimension of the embedding space.
pad_token_id (Optional[int]) – If specified, the entries at padding_idx do not contribute to the gradient; therefore, the embedding vector at padding_idx is not updated during training.
segment_embedding_size (int) – Dimension of the embedding space for segment embeddings. Useful when factorized embeddings are used for tokens and so the size of the embedding space for segments differs from that for tokens. Defaults to the same value as embedding_size.
embeddings_initializer (Optional[str,Callable]) – Token embeddings initializer. Defaults to ‘uniform’.
max_position_embeddings (int) – Maximum sequence length to train using model.
position_embedding_type (str) – ‘learned’, ‘fixed’ or ‘rotary’. Defaults to “learned”, for ‘rotary’ embeddings, embeddings are not created at bottom but computed with key&query embeddings by RotaryPositionEmbeddingHelper
min_timescale (Optional[int]) – The scale of the shortest sinusoid. Default to 1.0. (only need to be specified when position_embedding_type is fixed).
max_timescale (Optional[int]) – The scale of the longest sinusoid. Default to 1.0e4. (only need to be specified when position_embedding_type is fixed).
position_embeddings_initializer (Optional[str,Callable]) – Position embeddings initializer. Defaults to “uniform”.
num_segments (Optional[int]) – Number of segments for the segment embedding layer. Defaults to None, in which case the segment embedding layer is not created.
segment_embeddings_initializer (Optional[str,Callable]) – Segment embeddings initializer. Defaults to “uniform”.
- __init__(vocab_size, embedding_size, pad_token_id=None, positional_embedding_size=None, segment_embedding_size=None, embeddings_initializer='uniform', max_position_embeddings=None, position_embedding_type='learned', min_timescale=1.0, max_timescale=10000.0, position_embeddings_initializer='uniform', num_segments=None, segment_embeddings_initializer='uniform', device=None)#
- compute_positional_embeddings(input_ids, past_length=0, dtype=None)#
- compute_segment_embeddings(segment_ids)#
- compute_token_embeddings(input_ids)#
- create_fix_pos_embedding(seq_len, embed_len, min_timescale, max_timescale)#
adapted from: https://github.com/tensorflow/tensor2tensor/blob /1843c72d1d5faf4c085bb198b5dde0908f4081d0/tensor2tensor/layers /common_attention.py#L407
- forward(input_ids, segment_ids=None, past_length=0)#
- get_input_embeddings()#
- position_embedding_helper(num_heads=None, relative_attention_bias=None, num_relative_attention_buckets=32, bidirectional=False, initializer='xavier_uniform', rotary_dim=None)#
- reset_parameters()#
- set_input_embeddings(new_embeddings)#
common.pytorch.layers.FeedForwardNetwork module#
- class common.pytorch.layers.FeedForwardNetwork.FeedForwardNetwork#
Bases:
torch.nn.Module
A feed forward network that consists of a stack of fully connected layers arranged as [LinearLayer -> Activation -> Dropout] block repeated len(layers_units) times.
- Parameters
input_unit (int) – integer for number of in_features of input.
layers_units (int) – List of units for each layer.
layers_activation (str) – List of activation types (str) for each layer.
layers_dropout_rates (float) – List of dropout rates (float) for each layer.
use_bias (bool) – If True, use bias throughout all layers.
kernel_initializer – Kernel initializer. Defaults to “xavier_uniform”.
bias_initializer – Bias initializer. Defaults to “zeros”.
output_layer_initializer – If not None, initialize the last projection layer with this initializer. Defaults to None.
Initialize the FFN object instance.
- __init__(input_unit, layers_units, layers_activation=None, layers_dropout_rates=None, use_bias=False, kernel_initializer='xavier_uniform', bias_initializer='zeros', output_layer_initializer=None, device=None)#
Initialize the FFN object instance.
- forward(inputs)#
- reset_parameters()#
common.pytorch.layers.GPTJDecoderLayer module#
common.pytorch.layers.GaussianNLLLoss module#
common.pytorch.layers.HingeEmbeddingLoss module#
- class common.pytorch.layers.HingeEmbeddingLoss.HingeEmbeddingLoss#
Bases:
torch.nn.HingeEmbeddingLoss
common.pytorch.layers.HuberLoss module#
common.pytorch.layers.KLDivLoss module#
- class common.pytorch.layers.KLDivLoss.KLDivLoss#
Bases:
torch.nn.KLDivLoss
common.pytorch.layers.L1Loss module#
- class common.pytorch.layers.L1Loss.L1Loss#
Bases:
torch.nn.L1Loss
common.pytorch.layers.MSELoss module#
- class common.pytorch.layers.MSELoss.MSELoss#
Bases:
torch.nn.MSELoss
common.pytorch.layers.MarginRankingLoss module#
- class common.pytorch.layers.MarginRankingLoss.MarginRankingLoss#
Bases:
torch.nn.MarginRankingLoss
common.pytorch.layers.MultiLabelSoftMarginLoss module#
- class common.pytorch.layers.MultiLabelSoftMarginLoss.MultiLabelSoftMarginLoss#
Bases:
torch.nn.MultiLabelSoftMarginLoss
common.pytorch.layers.MultiMarginLoss module#
common.pytorch.layers.NLLLoss module#
- class common.pytorch.layers.NLLLoss.NLLLoss#
Bases:
torch.nn.NLLLoss
common.pytorch.layers.PoissonNLLLoss module#
- class common.pytorch.layers.PoissonNLLLoss.PoissonNLLLoss#
Bases:
torch.nn.PoissonNLLLoss
common.pytorch.layers.RelativePositionEmbeddingLayer module#
- class common.pytorch.layers.RelativePositionEmbeddingLayer.RelativePositionEmbeddingLayer#
Bases:
torch.nn.Module
Relative Position Embedding Layer
- Parameters
relative_attention_bias (Tensor) – Tensor with relative attention weights. Shape: [num_relative_attention_buckets, num_heads]. Defaults set to None.
num_relative_attention_buckets (int) – Number of buckets used to calculate relative position bias. Default: 32
max_relative_positions (int) – The maximum relative distance used when calculating relative position buckets. See relative_position_bucket docs for more details. Default: 128
bidirectional_relative_attention (bool) – Whether attention is bidirectional.
allow_negative_buckets (bool) – If enabled, position buckets will be both positive and negative (as required by certain models like DEBERTA). Default: False.
relative_attn_bias_initializer (bool) – Relative Attention bias initializer. Defaults to
xavier_uniform
.
- Returns
Relative position bias, to be used in attention masking
- Return type
position_bias (Tensor)
- __init__(num_heads, relative_attention_bias=None, num_relative_attention_buckets=32, max_relative_positions=128, bidirectional_relative_attention=False, allow_negative_buckets=False, relative_attn_bias_initializer='xavier_uniform')#
- static compute_raw_relative_positions(query_length, key_length, device=None)#
- compute_relative_positions(query_length, key_length)#
- forward(seq_length, key_length, past_kv=None)#
- get_embedding()#
- static relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128, allow_negative_buckets=False)#
Translate relative position to a bucket number for relative attention. The relative position is defined as memory_position - query_position, i.e. the distance in tokens from the attending position to the attended-to position. If
bidirectional_relative_attention
= False, then positive relative positions are invalid. We use smaller buckets for small absolute relative positions and larger buckets for larger absolute relative positions. All relative positions >= max_distance map to the same bucket. All relative positions <= -max_distance map to the same bucket. This should allow for more graceful generalization to longer sequences than the model has been trained on. :param relative_position: Tensor with relative positions. :type relative_position: Tensor :param bidirectional: Whether attention is bidirectional :type bidirectional: bool :param num_buckets: number of buckets for relative positions :type num_buckets: int :param max_distance: Used in order to calculate relative position buckets. :type max_distance: int :param allow_negative_buckets: If enabled, position buckets will be both positiveand negative (as required by certain models like DEBERTA). Default: False.
- Returns
a Tensor with the same shape as
relative_position
, containing int32 values in the range [0, num_relative_attention_buckets).
- reset_parameters()#
common.pytorch.layers.SmoothL1Loss module#
common.pytorch.layers.Transformer module#
common.pytorch.layers.TransformerDecoder module#
common.pytorch.layers.TransformerDecoderLayer module#
common.pytorch.layers.TransformerEncoder module#
common.pytorch.layers.TransformerEncoderLayer module#
common.pytorch.layers.TripletMarginLoss module#
- class common.pytorch.layers.TripletMarginLoss.TripletMarginLoss#
Bases:
torch.nn.TripletMarginLoss
common.pytorch.layers.TripletMarginWithDistanceLoss module#
- class common.pytorch.layers.TripletMarginWithDistanceLoss.TripletMarginWithDistanceLoss#
Bases:
torch.nn.TripletMarginWithDistanceLoss
common.pytorch.layers.utils module#
- common.pytorch.layers.utils.apply_loss_reduction(loss, reduction)#
- common.pytorch.layers.utils.apply_position_bias(embedding_helper, seq_length, key_length, past_kv=None)#
- common.pytorch.layers.utils.autogen_loss(loss_cls)#