cerebras.modelzoo.layers#

class cerebras.modelzoo.layers.AdaLayerNorm[source]#

Bases: torch.nn.Module

__init__(normalized_shape, eps=1e-05, device=None, dtype=None)[source]#
class cerebras.modelzoo.layers.AlibiPositionEmbeddingLayer[source]#

Bases: torch.nn.Module

Alibi Position Embedding Layer, Symmetric case with bidirectional supported

alibi bias as in paper: https://arxiv.org/abs/2108.12409

Parameters
  • num_heads (int) – number of attention heads.

  • slopes (Tensor) – slope values to use for alibi heads. Shape: [num_heads, 1]. Default to None.

  • alibi_trainable_slopes (bool) – whether the alibi slopes are trainable parameters.

  • slopes_initializer (str) – initializer for alibi slopes if it’s trainable. Defaults to xavier_uniform.

Returns

Relative position bias, to be used in attention masking

Return type

position_bias (Tensor)

__init__(num_heads, slopes=None, alibi_trainable_slopes=False, slopes_initializer='xavier_uniform', scaling_factor=1.0)[source]#
forward(seq_length, key_length, past_kv=None)[source]#

Return the position bias based on the alibi slopes.

Parameters
  • seq_length (int) – the length of query tokens.

  • key_length (int) – the length of key tokens.

Returns

Position bias tensor with shape [num_heads, query_length, key_length]

class cerebras.modelzoo.layers.MultiheadAttention[source]#

Bases: torch.nn.Module

Multi-head attention layer. Adapted from: https://pytorch.org/docs/stable/_modules/torch/nn/modules/activation.html#MultiheadAttention

Parameters
  • embed_dim (int) – Number of input units in each projection output

  • num_heads (int) – Number of attention heads.

  • inner_dim (int) – Number of output units in attention query/key/value projection. Defaults to embed_dim.

  • dropout (float) – Dropout rate for key-query weights. Defaults to 0.0.

  • batch_first (bool) – If True, then the input and output tensors are provided as (batch, seq, feature), otherwise the format will be (seq, batch, feature). Default: True (batch, seq, feature).

  • add_bias_kv (bool) – If specified, adds bias to the key and value sequences at dim=0. Default: False.

  • add_zero_attn (bool) – If specified, adds a new batch of zeros to the key and value sequences at dim=1. Default: False

  • kdim (int) – Number of input units in the key projection

  • vdim (int) – Number of input units in the value projection

  • use_projection_bias (bool) – Whether to use bias in the key, query, and value projections.

  • use_ffn_bias (bool) – Whether to use bias in the output projection.

  • attention_initializer (str) – Projection kernel initializer. Defaults to xavier_uniform.

  • attention_q_initializer – Query projection kernel initializer. If not specified, the query will be initialized via attention_initializer

  • output_layer_initializer (str or initializer) – If not None, use this initializer for the output transform layer. Defaults to None.

  • bias_initializer (str) – Bias initializer. Defaults to zeros.

  • attention_type (str) – The attention variant to execute. Currently accepts dot_product and scaled_dot_product. Defaults to scaled_dot_product.

  • scale_qk_dot_by_d (bool) – If True scales QK^T dot product by d(=hidden/d_head) instead of sqrt(d).

  • softmax_dtype_fp32 (bool) – Use an FP32 softmax implementation.

  • attention_kernel (str | None) –

    Kernel to use. Uses default if None. See accepted values below.

    None - Default implementation. fast_attention - Experimental optimized implementation.

  • device (optional) – Device to create the model parameters on, can be a cuda device or CS device.

__init__(embed_dim, num_heads, inner_dim=None, dropout=0.0, batch_first=True, add_bias_kv=False, add_zero_attn=False, kdim=None, vdim=None, use_projection_bias=None, use_ffn_bias=False, attention_initializer='xavier_uniform', attention_q_initializer=None, output_layer_initializer=None, bias_initializer='zeros', attention_type='scaled_dot_product', scale_qk_dot_by_d=False, softmax_dtype_fp32=True, attention_kernel=None, scale_qk_dot_by_layer_idx=False, device=None)[source]#
forward(q, k, v, attn_mask=None, key_padding_mask=None, need_weights=False, average_attn_weights=True, past_kv=None, cache_present_kv=False, past_kv_self_attn=True, position_bias=None, rotary_position_embedding_helper=None, layer_idx=None)[source]#

Applies the attention mechanism to queries q, keys k and values v.

Parameters
  • q (Tensor) – Queries, shape [batch_size, seq_length, embed_dim].

  • k (Tensor) – Keys, shape [batch_size, seq_length, embed_dim].

  • v (Tensor) – Values, shape [batch_size, seq_length, embed_dim].

  • attn_mask (Tensor) – Attention mask. Can be 2D of shape [batch_size, seq_length], or 3D of shape [batch, query_length, seq_length].

  • key_padding_mask (Tensor) – If specified, a mask of shape (N, S) indicating which elements within key to ignore for the purpose of attention (i.e. treat as “padding”). Defaults to None.

  • need_weights (bool) – If specified, returns attn_output_weights in addition to attn_outputs. Default: False.

  • average_attn_weights (bool) – If true, indicates that the returned attn_weights should be averaged across heads. Otherwise, attn_weights are provided separately per head. Note that this flag only has an effect when need_weights=True. Default: True (i.e. average weights across heads)

  • past_kv (tuple(tensor, tensor)) – Past keys and values. Tensors have shape [batch_size, num_heads, seq_length, embed_dim / num_heads]. The 0th and 1st tensor contain the past keys and values, respectively. Defaults to None.

  • cache_present_kv (bool) – Specifies if the present keys and values must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. Defaults to False.

  • past_kv_self_attn (bool) – Specifies whether the past keys & values should be used for self-attention (true) or cross-attention (false). Ignored if past_kv is not provided. Default: True

  • position_bias (Tensor) – Tensor containing position bias to apply in attention with shape [num_heads, query_length, key_length].

  • rotary_position_embedding_helper (Optional[RotaryPositionEmbeddingHelper]) – A helper class to apply rotary embedding on the input tensor.

Returns

Attention output tensor with shape [batch_size, seq_length, embed_dim].

class cerebras.modelzoo.layers.BatchChannelNorm2D[source]#

Bases: torch.nn.Module

Implements Batch Channel Normalization proposed in Micro-Batch Training with Batch-Channel Normalization and Weight Standardization <https://arxiv.org/abs/1903.10520>

Parameters
  • num_groups (int) – number of groups to separate the channels into.

  • num_channels (int) – number of channels. C from an expected input of size (N, C, H, W).

  • eps (float) – a value added to the denominator for numerical stability. Default: 1e-5.

  • momentum (float) – The Update rate value used for the running_mean and running_var computation. Default: 0.1.

  • device (torch.device) – Device to place the learnable parameters.

  • dtype (torch.dtype) – Data type of learnable parameters.

Shape:

input: (N, C, H, W) output: (N, C, H, W) (same shape as input)

__init__(num_groups, num_channels, eps=1e-05, momentum=0.1, device=None, dtype=None)[source]#
class cerebras.modelzoo.layers.BiaslessLayerNorm[source]#

Bases: torch.nn.Module

Applies Layer Normalization without a bias (beta) like in PaLM. Note that this is not the same as RMSNorm which also doesn’t shift the distribution by considering the mean.

\[y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma\]
Parameters
  • normalized_shape (int or list or torch.Size) –

    input shape from an expected input of size

    \[[* \times \text{normalized\_shape}[0] \times \text{normalized\_shape}[1] \times \ldots \times \text{normalized\_shape}[-1]]\]

    If a single integer is used, it is treated as a singleton list, and this module will normalize over the last dimension which is expected to be of that specific size.

  • eps – a value added to the denominator for numerical stability. Default: 1e-5

weight#

the learnable weights of the module of shape \(\text{normalized\_shape}\). The values are initialized to 1.

Shape:
  • Input: \((N, *)\)

  • Output: \((N, *)\) (same shape as input)

__init__(normalized_shape: Tuple[int, ...], eps: float = 1e-05, device=None, dtype=None) None[source]#
class cerebras.modelzoo.layers.EmbeddingLayer[source]#

Bases: torch.nn.Module

Creates token and, optionally, position and segment embeddings.

Parameters
  • vocab_size (int) – Size of input vocabulary.

  • embedding_size (int) – Dimension of the embedding space.

  • pad_token_id (Optional[int]) – If specified, the entries at padding_idx do not contribute to the gradient; therefore, the embedding vector at padding_idx is not updated during training.

  • segment_embedding_size (int) – Dimension of the embedding space for segment embeddings. Useful when factorized embeddings are used for tokens and so the size of the embedding space for segments differs from that for tokens. Defaults to the same value as embedding_size.

  • embeddings_initializer (Optional[str,Callable]) – Token embeddings initializer. Defaults to ‘uniform’.

  • max_position_embeddings (int) – Maximum sequence length to train using model.

  • position_embedding_type (str) – ‘learned’, ‘fixed’ or ‘rotary’. Defaults to “learned”, for ‘rotary’ embeddings, embeddings are not created at bottom but computed with key&query embeddings by RotaryPositionEmbeddingHelper

  • position_embedding_offset (int) – Offset for position embeddings. Default to 0.

  • min_timescale (Optional[int]) – The scale of the shortest sinusoid. Default to 1.0. (only need to be specified when position_embedding_type is fixed).

  • max_timescale (Optional[int]) – The scale of the longest sinusoid. Default to 1.0e4. (only need to be specified when position_embedding_type is fixed).

  • position_embeddings_initializer (Optional[str,Callable]) – Position embeddings initializer. Defaults to “uniform”.

  • num_segments (Optional[int]) – Number of segments for the segment embedding layer. Defaults to None, in which case the segment embedding layer is not created.

  • segment_embeddings_initializer (Optional[str,Callable]) – Segment embeddings initializer. Defaults to “uniform”.

  • (optional) (device) – Device to create the model parameters on, can be a cuda device or CS device.

__init__(vocab_size, embedding_size, pad_token_id=None, initializer='xavier_uniform', embeddings_initializer='uniform', device=None, position_embedding_type='learned', max_position_embeddings=None, positional_embedding_size=None, position_embedding_offset=0, position_embeddings_initializer='uniform', min_timescale=1.0, max_timescale=10000.0, mask_padding_in_positional_embed=False, num_heads=None, relative_attention_bias=None, num_relative_attention_buckets=32, bidirectional=False, rotary_dim=None, rope_theta=10000, pad_rope=False, alibi_slopes=None, alibi_trainable_slopes=False, pos_scaling_factor=1.0, num_segments=None, segment_embedding_size=None, segment_embeddings_initializer='uniform')[source]#
forward(input_ids, position_ids=None, segment_ids=None, past_length=0)[source]#
Convert input_ids to token embeddings according to the embedding type.

Word embeddings (required), segment embeddings (optional) and position embeddings (optional).

Parameters
  • input_ids (Tensor) – input token ids with shape [batch_size, seq_length].

  • position_ids (Tensor) – position ids with shape [batch_size, seq_length].

  • segment_ids (Tensor) – input segment ids with shape [batch_size, seq_length].

Returns

Token embedding output with shape [batch_size, seq_length, embedding_size].

class cerebras.modelzoo.layers.FeedForwardNetwork[source]#

Bases: torch.nn.Module

A feed forward network that consists of a stack of fully connected layers arranged as [LinearLayer -> Activation -> Dropout] block repeated len(layers_units) times.

Parameters
  • input_unit (int) – integer for number of in_features of input.

  • layers_units (list[int]) – List of units for each layer.

  • layers_activation (list[str]) – List of activation types (str) for each layer.

  • layers_dropout_rates (list[float]) – List of dropout rates (float) for each layer.

  • use_bias (bool) – If True, use bias throughout all layers.

  • kernel_initializer – Kernel initializer. Defaults to “xavier_uniform”.

  • bias_initializer – Bias initializer. Defaults to “zeros”.

  • output_layer_initializer – If not None, initialize the last projection layer with this initializer. Defaults to None.

  • device (optional) – Device to create the model parameters on, can be a cuda device or CS device.

Initialize the FFN object instance.

__init__(input_unit: int, layers_units: List[int], layers_activation: Optional[List[Union[str, Callable[[torch.Tensor], torch.Tensor]]]] = None, layers_dropout_rates: Optional[List[float]] = None, use_bias: bool = False, kernel_initializer: str = 'xavier_uniform', bias_initializer: str = 'zeros', output_layer_initializer: Optional[str] = None, device=None)[source]#

Initialize the FFN object instance.

class cerebras.modelzoo.layers.GPTJDecoderLayer[source]#

Bases: cerebras.modelzoo.layers.TransformerDecoderLayer.TransformerDecoderLayer

GPTJDecoderLayer is inherited from TransformerDecoderLayer, it has 2 modifications:

  1. It uses parallel decoder architecture instead of the sequential one

  2. It supports both gptj and gpt-neox which uses untied_layer_norm

Reference: https://www.cerebras.net/blog/how-to-harness-the-predictive-power-of-gpt-j

Parameters
  • d_model – the number of expected features in the input (required).

  • nhead – the number of heads in the multihead-attention models (required).

  • use_untied_layer_norm (bool) – whether to use untied layer_norm. Should be False for GPTJ and True for Neox

  • kwargs – the rest of the arguments the same as TransformerDecoderLayer

__init__(d_model: int, nhead: int, use_untied_layer_norm: bool = False, **kwargs)[source]#
forward(tgt: torch.Tensor, memory: Optional[torch.Tensor] = None, tgt_mask: Optional[torch.Tensor] = None, memory_mask: Optional[torch.Tensor] = None, tgt_key_padding_mask: Optional[torch.Tensor] = None, memory_key_padding_mask: Optional[torch.Tensor] = None, rotary_position_embedding_helper: Optional[cerebras.modelzoo.layers.RotaryPositionEmbeddingHelper.RotaryPositionEmbeddingHelper] = None, past_kv: Optional[Union[Tuple[torch.Tensor, torch.Tensor], Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]]] = None, cache_present_kv: bool = False, self_attn_position_bias: Optional[torch.Tensor] = None, cross_attn_position_bias: Optional[torch.Tensor] = None, layer_idx: Optional[int] = None) torch.Tensor[source]#

GPTJ layer with rotary position embeddings and parallel decoder architecture

Parameters
  • tgt – the sequence to the decoder layer (required).

  • memory – the sequence from the last layer of the encoder (required).

  • tgt_mask – the mask for the tgt sequence (optional).

  • memory_mask – the mask for the memory sequence (optional).

  • tgt_key_padding_mask – the mask for the tgt keys per batch (optional).

  • memory_key_padding_mask – the mask for the memory keys per batch (optional).

  • rotary_position_embedding_helper (Optional[RotaryPositionEmbeddingHelper]) – A helper class to apply rotary embedding on the input tensor.

  • past_kv – Past keys and values for self attention and (if applicable) cross attention modules. Key/value tensors have shape [batch_size, num_heads, seq_length, embed_dim / num_heads]. (optional).

  • cache_present_kv – Specifies if the present keys and values must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. (optional).

  • self_attn_position_bias – the tensor containing position bias to apply in self-attention, can be obtained from relative or alibi position embeddings.

Shape:

Output tensor with shape

class cerebras.modelzoo.layers.GroupInstanceNorm[source]#

Bases: torch.nn.Module

Uses torch.nn.GroupNorm to emulate InstanceNorm by setting number of groups equal to the number of channels.

Parameters

num_channels (int) – number of channels. C from an expected input of size (N, C, H, W).

__init__(num_channels, eps=1e-05, affine=True, device=None, dtype=None)[source]#
class cerebras.modelzoo.layers.MultiQueryAttention[source]#

Bases: cerebras.modelzoo.layers.AttentionLayer.MultiheadAttention

Implements the Multi-Query Attention Layer from

Fast Transformer Decoding: One Write-Head is All You Need <https://arxiv.org/abs/1911.02150>

Parameters
  • embed_dim (int) – Number of input units in each projection output

  • num_heads (int) – Number of attention heads.

  • inner_dim (int) – Number of output units in attention query/key/value projection. Defaults to embed_dim.

  • dropout (float) – Dropout rate for key-query weights. Defaults to 0.0.

  • batch_first (bool) – If True, then the input and output tensors are provided as (batch, seq, feature), otherwise the format will be (seq, batch, feature). Default: True (batch, seq, feature).

  • add_bias_kv (bool) – If specified, adds bias to the key and value sequences at dim=0. Default: False.

  • add_zero_attn (bool) – If specified, adds a new batch of zeros to the key and value sequences at dim=1. Default: False

  • kdim (int) – Number of output units in key projection

  • vdim (int) – Number of output units in projection

  • use_projection_bias (bool) – Whether to use bias in the key, query, and value projections.

  • use_ffn_bias (bool) – Whether to use bias in the output projection.

  • attention_initializer (str) – Projection kernel initializer. Defaults to xavier_uniform.

  • attention_q_initializer – Query projection kernel initializer. If not specified, the query will be initialized via attention_initializer

  • output_layer_initializer (str or initializer) – If not None, use this initializer for the output transform layer. Defaults to None.

  • bias_initializer (str) – Bias initializer. Defaults to zeros.

  • attention_type (str) – The attention variant to execute. Currently accepts dot_product and scaled_dot_product. Defaults to scaled_dot_product.

  • softmax_dtype_fp32 (bool) – Use an FP32 softmax implementation.

  • attention_kernel (str | None) –

    Kernel to use. Uses default if None. See accepted values below.

    None - Default implementation. fast_attention - Experimental optimized implementation.

  • device (optional) – Device to create the model parameters on, can be a cuda device or CS device.

__init__(embed_dim, num_heads, inner_dim=None, dropout=0.0, batch_first=True, add_bias_kv=False, add_zero_attn=False, kdim=None, vdim=None, use_projection_bias=None, use_ffn_bias=False, attention_initializer='xavier_uniform', attention_q_initializer=None, output_layer_initializer=None, bias_initializer='zeros', attention_type='scaled_dot_product', scale_qk_dot_by_d=False, softmax_dtype_fp32=True, attention_kernel=None, scale_qk_dot_by_layer_idx=False, device=None, num_kv_groups=1)[source]#
class cerebras.modelzoo.layers.RelativePositionEmbeddingLayer[source]#

Bases: torch.nn.Module

Relative Position Embedding Layer

Parameters
  • num_heads (int) – number of attention heads.

  • relative_attention_bias (Tensor) – Tensor with relative attention weights. Shape: [num_relative_attention_buckets, num_heads]. Defaults set to None.

  • num_relative_attention_buckets (int) – Number of buckets used to calculate relative position bias. Default: 32

  • max_relative_positions (int) – The maximum relative distance used when calculating relative position buckets. See relative_position_bucket docs for more details. Default: 128

  • bidirectional_relative_attention (bool) – Whether attention is bidirectional.

  • allow_negative_buckets (bool) – If enabled, position buckets will be both positive and negative (as required by certain models like DEBERTA). Default: False.

  • relative_attn_bias_initializer (bool) – Relative Attention bias initializer. Defaults to xavier_uniform.

Returns

Relative position bias, to be used in attention masking

Return type

position_bias (Tensor)

__init__(num_heads, relative_attention_bias=None, num_relative_attention_buckets=32, max_relative_positions=128, bidirectional_relative_attention=False, allow_negative_buckets=False, relative_attn_bias_initializer='xavier_uniform')[source]#
forward(seq_length, key_length, past_kv=None)[source]#

Return the position bias.

Parameters
  • seq_length (int) – the length of query tokens.

  • key_length (int) – the length of key tokens.

Returns

Position bias tensor with shape [num_heads, query_length, key_length]

static relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128, allow_negative_buckets=False)[source]#

Translate relative position to a bucket number for relative attention. The relative position is defined as memory_position - query_position, i.e. the distance in tokens from the attending position to the attended-to position. If bidirectional_relative_attention = False, then positive relative positions are invalid. We use smaller buckets for small absolute relative positions and larger buckets for larger absolute relative positions. All relative positions >= max_distance map to the same bucket. All relative positions <= -max_distance map to the same bucket. This should allow for more graceful generalization to longer sequences than the model has been trained on. :param relative_position: Tensor with relative positions. :type relative_position: Tensor :param bidirectional: Whether attention is bidirectional :type bidirectional: bool :param num_buckets: number of buckets for relative positions :type num_buckets: int :param max_distance: Used in order to calculate relative position buckets. :type max_distance: int :param allow_negative_buckets: If enabled, position buckets will be both positive

and negative (as required by certain models like DEBERTA). Default: False.

Returns

a Tensor with the same shape as relative_position, containing int32 values in the range [0, num_relative_attention_buckets).

class cerebras.modelzoo.layers.RMSNorm[source]#

Bases: torch.nn.Module

Construct a layernorm module in the T5 style no bias and no subtraction of mean.

__init__(hidden_size, eps=1e-06, device=None)[source]#

Construct a layernorm module in the T5 style no bias and no subtraction of mean.

class cerebras.modelzoo.layers.Transformer[source]#

Bases: torch.nn.Module

A transformer model. User is able to modify the attributes as needed. The architecture is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users can build the BERT(https://arxiv.org/abs/1810.04805) model with corresponding parameters.

Parameters
  • d_model – the number of expected features in the encoder/decoder inputs (default=512).

  • nhead – the number of heads in the multihead attention models (default=8).

  • num_encoder_layers – the number of sub-encoder-layers in the encoder (default=6).

  • num_decoder_layers – the number of sub-decoder-layers in the decoder (default=6).

  • dim_feedforward – the dimension of the feedforward network model (default=2048).

  • dropout – the dropout value (default=0.1).

  • activation – the activation function of encoder/decoder intermediate layer, can be a string (“relu” or “gelu”) or a unary callable. Default: gelu

  • custom_encoder – custom encoder (default=None).

  • custom_decoder – custom decoder (default=None).

  • layer_norm_eps – the eps value in layer normalization components (default=1e-5).

  • batch_first – If True, then the input and output tensors are provided as (batch, seq, feature). Default: False (seq, batch, feature).

  • norm_first – if True, encoder and decoder layers will perform LayerNorms before other attention and feedforward operations, otherwise after. Default: False (after).

  • attention_type – Should be in [“scaled_dot_product”, “dot_product”].

  • use_projection_bias_in_attention – Add bias to Q,K,V projections in the Attention layer. Defaults to False.

  • use_ffn_bias_in_attention – Add bias in the concluding FFN in the Attention layer. Defaults to False.

  • use_ffn_bias – Add bias in all dense layers of the decoder’s ffn sublayer.

  • attention_initializer – Attention layer initializer. Defaults to “xavier_uniform”.

  • ffn_initializer – FFN layer initializer. Defaults to “xavier_uniform”.

  • device (optional) – Device to create the model parameters on, can be a cuda device or CS device.

Examples::
>>> transformer_model = nn.Transformer(nhead=16, num_encoder_layers=12)
>>> src = torch.rand((10, 32, 512))
>>> tgt = torch.rand((20, 32, 512))
>>> out = transformer_model(src, tgt)

Note: A full example to apply nn.Transformer module for the word language model is available in https://github.com/pytorch/examples/tree/master/word_language_model

__init__(d_model: int = 512, nhead: int = 8, num_encoder_layers: int = 6, num_decoder_layers: int = 6, dim_feedforward: int = 2048, dropout: float = 0.1, activation: Union[str, Callable[[torch.Tensor], torch.Tensor]] = 'gelu', custom_encoder: Optional[Any] = None, custom_decoder: Optional[Any] = None, layer_norm_eps: float = 1e-05, batch_first: bool = True, norm_first: bool = False, attention_type='scaled_dot_product', use_projection_bias_in_attention=False, use_ffn_bias_in_attention=False, use_ffn_bias=False, attention_initializer='xavier_uniform', ffn_initializer='xavier_uniform', device=None) None[source]#
forward(src: torch.Tensor, tgt: torch.Tensor, src_mask: Optional[torch.Tensor] = None, tgt_mask: Optional[torch.Tensor] = None, memory_mask: Optional[torch.Tensor] = None, src_key_padding_mask: Optional[torch.Tensor] = None, tgt_key_padding_mask: Optional[torch.Tensor] = None, memory_key_padding_mask: Optional[torch.Tensor] = None) torch.Tensor[source]#

Take in and process masked source/target sequences.

Parameters
  • src – the sequence to the encoder (required).

  • tgt – the sequence to the decoder (required).

  • src_mask – the additive mask for the src sequence (optional).

  • tgt_mask – the additive mask for the tgt sequence (optional).

  • memory_mask – the additive mask for the encoder output (optional).

  • src_key_padding_mask – the ByteTensor mask for src keys per batch (optional).

  • tgt_key_padding_mask – the ByteTensor mask for tgt keys per batch (optional).

  • memory_key_padding_mask – the ByteTensor mask for memory keys per batch (optional).

Shape:
  • src: \((S, E)\) for unbatched input, \((S, N, E)\) if batch_first=False or (N, S, E) if batch_first=True.

  • tgt: \((T, E)\) for unbatched input, \((T, N, E)\) if batch_first=False or (N, T, E) if batch_first=True.

  • src_mask: \((S, S)\) or \((N\cdot\text{num\_heads}, S, S)\).

  • tgt_mask: \((T, T)\) or \((N\cdot\text{num\_heads}, T, T)\).

  • memory_mask: \((T, S)\).

  • src_key_padding_mask: \((S)\) for unbatched input otherwise \((N, S)\).

  • tgt_key_padding_mask: \((T)\) for unbatched input otherwise \((N, T)\).

  • memory_key_padding_mask: \((S)\) for unbatched input otherwise \((N, S)\).

Note: [src/tgt/memory]_mask ensures that position i is allowed to attend the unmasked positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged. If a BoolTensor is provided, positions with True are not allowed to attend while False values will be unchanged. If a FloatTensor is provided, it will be added to the attention weight. [src/tgt/memory]_key_padding_mask provides specified elements in the key to be ignored by the attention. If a ByteTensor is provided, the non-zero positions will be ignored while the zero positions will be unchanged. If a BoolTensor is provided, the positions with the value of True will be ignored while the position with the value of False will be unchanged.

  • output: \((T, E)\) for unbatched input, \((T, N, E)\) if batch_first=False or (N, T, E) if batch_first=True.

Note: Due to the multi-head attention architecture in the transformer model, the output sequence length of a transformer is same as the input sequence (i.e. target) length of the decode.

where S is the source sequence length, T is the target sequence length, N is the batch size, E is the feature number

Examples

>>> output = transformer_model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
class cerebras.modelzoo.layers.TransformerDecoder[source]#

Bases: torch.nn.Module

TransformerDecoder is a stack of N decoder layers

Parameters
  • decoder_layer – an instance of the TransformerDecoderLayer() class (required).

  • num_layers – the number of sub-decoder-layers in the decoder (required).

  • norm – the layer normalization component (optional).

Examples::
>>> decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
>>> transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)
>>> memory = torch.rand(10, 32, 512)
>>> tgt = torch.rand(20, 32, 512)
>>> out = transformer_decoder(tgt, memory)
__init__(decoder_layer, num_layers, norm=None)[source]#
forward(tgt: torch.Tensor, memory: Optional[torch.Tensor] = None, tgt_mask: Optional[torch.Tensor] = None, sparse_mask: Optional[torch.Tensor] = None, memory_mask: Optional[torch.Tensor] = None, tgt_key_padding_mask: Optional[torch.Tensor] = None, memory_key_padding_mask: Optional[torch.Tensor] = None, self_attn_position_bias: Optional[torch.Tensor] = None, cross_attn_position_bias: Optional[torch.Tensor] = None, rotary_position_embedding_helper: Optional[cerebras.modelzoo.layers.RotaryPositionEmbeddingHelper.RotaryPositionEmbeddingHelper] = None, past_kv: Optional[List[Union[Tuple[torch.Tensor, torch.Tensor], Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]]]] = None, cache_present_kv: bool = False, extract_layer_idx: Optional[int] = None, **extra_args) Union[torch.Tensor, Tuple[torch.Tensor, List[Union[Tuple[torch.Tensor, torch.Tensor], Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]]]]][source]#

Pass the inputs (and mask) through the decoder layer in turn.

Parameters
  • tgt – the sequence to the decoder (required).

  • memory – the sequence from the last layer of the encoder (optional).

  • tgt_mask – the mask for the tgt sequence (optional).

  • memory_mask – the mask for the memory sequence (optional).

  • tgt_key_padding_mask – the mask for the tgt keys per batch (optional).

  • memory_key_padding_mask – the mask for the memory keys per batch (optional).

  • self_attn_position_bias – the tensor containing position bias to apply in self-attention, can be obtained from relative or alibi position embeddings.

  • cross_attn_position_bias – similar to self_attn_position_bias, this is the tensor containing position bias to apply in cross-attention.

  • rotary_position_embedding_helper (Optional[RotaryPositionEmbeddingHelper]) – A helper class to apply rotary embedding on the input tensor.

  • past_kv – Past keys and values for each of the decoder layers (optional).

  • cache_present_kv – Specifies if the present keys and values must be cached and returned. (optional).

  • extract_layer_idx – (inclusive)layer index in range [0, self.num_layers) (zero-indexed) Applies decoder layers up to (and including) extract_layer_idx instead of all decoder layers. For ex: extract_layer_idx=3 would run fwd pass from decoder_block_0 to decoder_block_3 and return outputs from decoder_block_3. If extract_layer_idx = None and norm != None, then the output returned would be decoder_block_{self.num_layers-1} -> norm -> output (return)

Shape:

see the docs in Transformer class.

class cerebras.modelzoo.layers.TransformerDecoderLayer[source]#

Bases: torch.nn.Module

TransformerDecoderLayer is made up of self-attn, multihead-attn and feedforward network. This standard decoder layer is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users may modify or implement in a different way during application.

Parameters
  • d_model – the number of expected features in the input (required).

  • nhead – the number of heads in the multihead-attention models (required).

  • dim_feedforward – the dimension of the feedforward network model (default=2048).

  • dropout – the dropout value (default=0.1).

  • activation – the activation function of the intermediate layer, can be a string (“relu” or “gelu”) or a unary callable. Default: gelu

  • layer_norm_eps – the eps value in layer normalization components (default=1e-5).

  • batch_first – If True, then the input and output tensors are provided as (batch, seq, feature). Default: False (seq, batch, feature).

  • norm_layer – the normalization class that will be used before/after FF layers (default=nn.LayerNorm)

  • norm_first – if True, layer norm is done prior to self attention, multihead attention and feedforward operations, respectively. Otherwise it’s done after. Default: False (after).

  • attention_dropout_rate – Attention dropout rate. If None, defaults to dropout.

  • attention_softmax_fp32 – Use FP32 softmax in attention block.

  • use_projection_bias_in_attention – Add bias to Q,K,V projections in the Attention layer. Defaults to False.

  • attention_type – Should be in [“scaled_dot_product”, “dot_product”]

  • scale_qk_dot_by_d (bool) – If True scales QK^T dot product by d(=hidden/d_head) instead of sqrt(d).

  • attention_inner_dim (int) – Number of output units in attention query/key/value projection. Defaults to d_model

  • add_cross_attention – If True, adds cross-attention layer between encoder/decoder, otherwise, only self-attention is used in the decoder (GPT-style models should set to False)

  • use_ffn_bias_in_attention – Add bias in the concluding FFN in the Attention layer. Defaults to False.

  • use_ffn_bias – Add bias in all dense layers of the decoder’s ffn sublayer

  • attention_initializer – Attention layer initializer. Defaults to “xavier_uniform”.

  • attention_q_initializer – Query projection kernel initializer. If not specified, the query will be initialized via attention_initializer

  • attention_output_layer_initializer – attention output layer projection initializer. If not specified, the output will be initialized via attention_initializer

  • ffn_initializer – FFN layer initializer. Defaults to “xavier_uniform”.

  • ffn_output_layer_initializer – If not None, initialize the last FFN layer with this initializer. Defaults to None.

  • use_ff_layer1_dropout – If True, dropout will be enabled after the first feed forward layer. Default: True

  • True (use_ff_layer2_dropout = If) – True

  • Default (dropout will be enabled after the second feed forward layer.) – True

  • ffn_dropout_rate – Controls dropout rate of FF’s first layer. If None, defaults to dropout.

Examples

>>> decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8, batch_first=True)
>>> memory = torch.rand(32, 10, 512)
>>> tgt = torch.rand(32, 20, 512)
>>> out = decoder_layer(tgt, memory)
__init__(d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1, activation: Union[str, Callable[[torch.Tensor], torch.Tensor]] = 'gelu', layer_norm_eps: float = 1e-05, batch_first: bool = True, norm_layer: Type[torch.nn.Module] = torch.nn.LayerNorm, norm_first: bool = False, attention_module: Union[str, torch.nn.Module] = 'aiayn_attention', extra_attention_params={}, device=None, add_cross_attention: bool = True, attention_dropout_rate: Optional[float] = None, attention_softmax_fp32: Optional[bool] = True, attention_type='scaled_dot_product', scale_qk_dot_by_d=False, scale_qk_dot_by_layer_idx=False, attention_inner_dim=None, cross_attention_kv_dim=None, use_projection_bias_in_attention=False, use_ffn_bias_in_attention=False, use_ffn_bias=False, attention_initializer='xavier_uniform', attention_q_initializer=None, attention_output_layer_initializer=None, ffn_initializer='xavier_uniform', ffn_output_layer_initializer=None, use_ff_layer1_dropout: bool = True, use_ff_layer2_dropout: bool = True, ffn_dropout_rate: Optional[float] = None) None[source]#
forward(tgt: torch.Tensor, memory: Optional[torch.Tensor] = None, tgt_mask: Optional[torch.Tensor] = None, memory_mask: Optional[torch.Tensor] = None, tgt_key_padding_mask: Optional[torch.Tensor] = None, memory_key_padding_mask: Optional[torch.Tensor] = None, rotary_position_embedding_helper: Optional[cerebras.modelzoo.layers.RotaryPositionEmbeddingHelper.RotaryPositionEmbeddingHelper] = None, past_kv: Optional[Union[Tuple[torch.Tensor, torch.Tensor], Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]]] = None, cache_present_kv: bool = False, self_attn_position_bias: Optional[torch.Tensor] = None, cross_attn_position_bias: Optional[torch.Tensor] = None, layer_idx: Optional[int] = None, **extra_args) Union[torch.Tensor, Tuple[torch.Tensor, Union[Tuple[torch.Tensor, torch.Tensor], Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]]]][source]#

Pass the inputs (and mask) through the decoder layer.

Parameters
  • tgt – the sequence to the decoder layer (required).

  • memory – the sequence from the last layer of the encoder (required).

  • tgt_mask – the mask for the tgt sequence (optional).

  • memory_mask – the mask for the memory sequence (optional).

  • tgt_key_padding_mask – the mask for the tgt keys per batch (optional).

  • memory_key_padding_mask – the mask for the memory keys per batch (optional).

  • rotary_position_embedding_helper (Optional[RotaryPositionEmbeddingHelper]) – A helper class to apply rotary embedding on the input tensor.

  • past_kv – Past keys and values for self attention and (if applicable) cross attention modules. Key/value tensors have shape [batch_size, num_heads, seq_length, embed_dim / num_heads]. (optional).

  • cache_present_kv – Specifies if the present keys and values must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. (optional).

  • self_attn_position_bias – the tensor containing position bias to apply in self-attention, can be obtained from relative or alibi position embeddings.

Shape:

see the docs in Transformer class.

class cerebras.modelzoo.layers.TransformerEncoder[source]#

Bases: torch.nn.Module

TransformerEncoder is a stack of N encoder layers

Parameters
  • encoder_layer – an instance of the TransformerEncoderLayer() class (required).

  • num_layers – the number of sub-encoder-layers in the encoder (required).

  • norm – the layer normalization component (optional).

  • enable_nested_tensor – if True, input will automatically convert to nested tensor (and convert back on output). This will improve the overall performance of TransformerEncoder when padding rate is high. Default: False (disabled).

Examples::
>>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
>>> transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)
>>> src = torch.rand(10, 32, 512)
>>> out = transformer_encoder(src)
__init__(encoder_layer, num_layers, norm=None, enable_nested_tensor=False)[source]#
forward(src: torch.Tensor, mask: Optional[torch.Tensor] = None, src_key_padding_mask: Optional[torch.Tensor] = None, rotary_position_embedding_helper: Optional[cerebras.modelzoo.layers.RotaryPositionEmbeddingHelper.RotaryPositionEmbeddingHelper] = None, self_attn_position_bias: Optional[torch.Tensor] = None, extract_layer_idx: Optional[int] = None, **extra_args) torch.Tensor[source]#

Pass the input through the encoder layers in turn.

Parameters
  • src – the sequence to the encoder (required).

  • mask – the mask for the src sequence (optional).

  • src_key_padding_mask – the mask for the src keys per batch (optional).

  • rotary_position_embedding_helper (Optional[RotaryPositionEmbeddingHelper]) – A helper class to apply rotary embedding on the input tensor.

  • self_attn_position_bias – the tensor containing position bias to apply in self-attention, can be obtained from relative or alibi position embeddings.

  • extract_layer_idx – (inclusive)layer index in range [0, self.num_layers) (zero-indexed) Applies encoder layers up to (and including) extract_layer_idx instead of all encoder layers. For ex: extract_layer_idx=3 would run fwd pass from encoder_block_0 to encoder_block_3 and return outputs from encoder_block_3. If extract_layer_idx = None and norm != None, then the output returned would be encoder_block_{self.num_layers-1} -> norm -> output (return)

Shape:

see the docs in Transformer class.

class cerebras.modelzoo.layers.TransformerEncoderLayer[source]#

Bases: torch.nn.Module

TransformerEncoderLayer is made up of self-attn and feedforward network. This standard encoder layer is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users may modify or implement in a different way during application.

Parameters
  • d_model – the number of expected features in the input (required).

  • nhead – the number of heads in the multihead attention models (required).

  • dim_feedforward – the dimension of the feedforward network model (default=2048).

  • dropout – the dropout value (default=0.1).

  • activation – the activation function of the intermediate layer, can be a string (“relu” or “gelu”) or a unary callable. Default: gelu

  • layer_norm_eps – the eps value in layer normalization components (default=1e-5).

  • batch_first – If True, then the input and output tensors are provided as (batch, seq, feature). Default: False (seq, batch, feature).

  • norm_layer – the normalization class that will be used before/after FF layers (default=nn.LayerNorm)

  • norm_first – if True, layer norm is done prior to attention and feedforward operations, respectively. Otherwise it’s done after. Default: False (after).

  • attention_dropout_rate – Attention dropout rate. If None, defaults to dropout.

  • use_projection_bias_in_attention – Add bias to Q,K,V projections in the Attention layer. Defaults to False.

  • attention_type – Should be in [“scaled_dot_product”, “dot_product”]

  • attention_softmax_fp32 – Use FP32 softmax in attention block.

  • attention_inner_dim (int) – Number of output units in attention query/key/value projection. Defaults to d_model

  • add_cross_attention – If True, adds cross-attention layer between encoder/decoder, otherwise, only self-attention is used in the decoder (GPT-style models should set to False)

  • use_ffn_bias_in_attention – Add bias in the concluding FFN in the Attention layer. Defaults to False.

  • use_ffn_bias – Add bias in all dense layers of the decoder’s ffn sublayer

  • attention_initializer – Attention layer initializer. Defaults to “xavier_uniform”.

  • attention_q_initializer – Query projection kernel initializer. If not specified, the query will be initialized via attention_initializer

  • ffn_initializer – FFN layer initializer. Defaults to “xavier_uniform”.

  • ffn_output_layer_initializer – If not None, initialize the last FFN layer with this initializer. Defaults to None.

  • use_ff_layer1_dropout – If True, dropout will be enabled after the first feed forward layer. Default: True

  • True (use_ff_layer2_dropout = If) – True

  • Default (dropout will be enabled after the second feed forward layer.) – True

  • ffn_dropout_rate – Controls dropout rate of FF’s first layer. If None, defaults to dropout.

Example

When batch_first is True: >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=True) >>> src = torch.rand(32, 10, 512) >>> out = encoder_layer(src)

__init__(d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1, activation: Union[str, Callable[[torch.Tensor], torch.Tensor]] = 'gelu', layer_norm_eps: float = 1e-05, batch_first: bool = True, norm_layer: Type[torch.nn.Module] = torch.nn.LayerNorm, norm_first: bool = False, attention_module: Union[str, torch.nn.Module] = 'aiayn_attention', extra_attention_params={}, device=None, attention_dropout_rate: Optional[float] = None, attention_type='scaled_dot_product', attention_softmax_fp32: Optional[bool] = True, attention_inner_dim=None, use_projection_bias_in_attention=False, use_ffn_bias_in_attention=False, use_ffn_bias=False, attention_initializer='xavier_uniform', attention_q_initializer=None, ffn_initializer='xavier_uniform', ffn_output_layer_initializer=None, use_ff_layer1_dropout: bool = True, use_ff_layer2_dropout: bool = True, ffn_dropout_rate: Optional[float] = None) None[source]#
forward(src: torch.Tensor, src_mask: Optional[torch.Tensor] = None, src_key_padding_mask: Optional[torch.Tensor] = None, rotary_position_embedding_helper: Optional[cerebras.modelzoo.layers.RotaryPositionEmbeddingHelper.RotaryPositionEmbeddingHelper] = None, self_attn_position_bias: Optional[torch.Tensor] = None, **extra_args) torch.Tensor[source]#

Pass the input through the encoder layer.

Parameters
  • src – the sequence to the encoder layer (required).

  • src_mask – the mask for the src sequence (optional).

  • src_key_padding_mask – the mask for the src keys per batch (optional).

  • rotary_position_embedding_helper (Optional[RotaryPositionEmbeddingHelper]) – A helper class to apply rotary embedding on the input tensor.

  • self_attn_position_bias – the tensor containing position bias to apply in self-attention, can be obtained from relative or alibi position embeddings.

Shape:

see the docs in Transformer class.

class cerebras.modelzoo.layers.ViTEmbeddingLayer[source]#

Bases: torch.nn.Module

__init__(image_size=[224, 224], num_channels=3, patch_size=[16, 16], hidden_size=768, initializer_range=0.02, embedding_dropout_rate=0.0, projection_initializer=None, position_embedding_initializer=None, position_embedding_type='learned', use_conv_patchified_embedding=False, prepend_cls_token=False, init_conv_like_linear=False, use_post_embed_layer_norm=False, layer_norm_epsilon=1e-05, use_embed_proj_bias=True)[source]#
select_patches(patches, patch_indices=None)[source]#

Select from patches based on patch_indices

Parameters
  • patches (Tensor) – shape [batch_size, full_sequence_length, hidden_size]

  • patch_indices (Tensor) – shape [batch_size., subset_sequence_length]

Returns

shape [batch_size, subset_sequence_length, hidden_size]

Return type

patches (Tensor)

forward(input_images, patch_indices=None)[source]#

Applies patching and linear projection to the input images.

Parameters
  • input_images (Tensor) – shape if use_conv_patchified_embedding [batch_size, num_channels, height, width] else [batch_size, sequence_len, embedding_size].

  • patch_indices (Tensor) – shape [batch_size, subset_seq_length]. If specified, embedding layer will select a subset of all image patches based on indices. This is used for applications like MAE. Default to None.

Returns

shape [batch_size, sequence_length, hidden_size].

Return type

image_embeddings (Tensor)