modelzoo.transformers.pytorch.bert.bert_model.BertModel#

class modelzoo.transformers.pytorch.bert.bert_model.BertModel[source]#

Bases: torch.nn.Module

The model behaves as a bidirectional encoder (with only self-attention), following the architecture described in Attention is all you need by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.

Parameters
  • vocab_size (int, optional, defaults to 30522) – Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling BertModel or TFBertModel.

  • max_position_embeddings (int, optional, defaults to 512) – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

  • position_embedding_type (str, optional, defaults to ‘learned’) – The type of position embeddings, should either be ‘learned’ or ‘fixed’.

  • hidden_size (int, optional, defaults to 768) – Dimensionality of the encoder layers and the pooler layer.

  • embedding_dropout_rate (float, optional, defaults to 0.1) – The dropout ratio for the word embeddings.

  • embedding_pad_token_id (int, optional, defaults to 0) – The embedding vector at embedding_pad_token_id is not updated during training.

  • num_hidden_layers (int, optional, defaults to 12) – Number of hidden layers in the Transformer encoder.

  • layer_norm_epsilon (float, optional, defaults to 1e-5) – The epsilon used by the layer normalization layers.

  • num_heads (int, optional, defaults to 12) – Number of attention heads for each attention layer in the Transformer encoder.

  • attention_type (str, optional, defaults to ‘scaled_dot_product’) – The attention variant to execute. Currently accepts dot_product and scaled_dot_product.

  • attention_softmax_fp32 (bool, optional, defaults to True) – If True, attention softmax uses fp32 precision else fp16/bf16 precision

  • dropout_rate (float, optional, defaults to 0.1) – The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

  • nonlinearity – (string, optional, defaults to gelu): The non-linear activation function (function or string) in the encoder and pooler.

  • attention_dropout_rate (float, optional, defaults to 0.1) – The dropout ratio for the attention probabilities.

  • use_projection_bias_in_attention (bool, optional, defaults to True) – If True, bias is used on the projection layers in attention.

  • use_ffn_bias_in_attention (bool, optional, defaults to True) – If True, bias is used in the dense layer in the attention.

  • filter_size (int, optional, defaults to 3072) – Dimensionality of the feed-forward layer in the Transformer encoder.

  • use_ffn_bias – (bool, optional, defaults to True): If True, bias is used in the dense layer in the encoder.

  • initializer_range (float, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer as the default initializer.

  • num_segments (int, optional, defaults to 2) – The vocabulary size of the segments (sentence types).

  • embeddings_initializer (dict, optional, defaults to None) – Initializer for word embeddings

  • position_embeddings_initializer (dict, optional, defaults to None) – Initializer for position embeddings (if learned position embeddings)

  • segment_embeddings_initializer (dict, optional, defaults to None) – Initializer for segment embeddings

  • add_pooling_layer (bool, optional, defaults to True) – Whether to add the pooling layer for sequence classification.

Methods

compute_input_embeddings

forward

param input_ids

The id of input tokens

reset_parameters

__call__(*args: Any, **kwargs: Any) Any#

Call self as a function.

__init__(vocab_size=50257, max_position_embeddings=1024, position_embedding_type='learned', hidden_size=768, embedding_dropout_rate=0.1, embedding_pad_token_id=0, mask_padding_in_positional_embed=False, rotary_dim=None, rope_theta=10000, num_relative_attention_buckets=32, alibi_trainable_slopes=False, pos_scaling_factor=1.0, num_hidden_layers=12, layer_norm_epsilon=1e-05, norm_first=False, embedding_layer_norm=True, num_heads=12, attention_module='aiayn_attention', extra_attention_params={}, attention_type='scaled_dot_product', attention_softmax_fp32=True, dropout_rate=0.1, nonlinearity='gelu', pooler_nonlinearity=None, attention_dropout_rate=0.1, use_projection_bias_in_attention=True, use_ffn_bias_in_attention=True, filter_size=3072, use_ffn_bias=True, use_final_layer_norm=False, initializer_range=0.02, num_segments=2, default_initializer=None, embeddings_initializer=None, position_embeddings_initializer=None, segment_embeddings_initializer=None, add_pooling_layer=True, **extra_args)[source]#
static __new__(cls, *args: Any, **kwargs: Any) Any#
forward(input_ids=None, attention_mask=None, position_ids=None, segment_ids=None)[source]#
Parameters
  • input_ids (Tensor) – The id of input tokens Can be of shape ``[batch_size, seq_length]

  • position_ids (Tensor) – The position id of input tokens. Can be of shape [batch_size, seq_length]

  • segment_ids (Tensor) – The segment id of input tokens, indicating which sequence the token belongs to Can be of shape ``[batch_size, seq_length]

  • attention_mask (Tensor) – Can be 2D of shape [batch_size, seq_length], or 3D of shape [batch, query_length, seq_length], or 4D of shape [batch, num_heads, query_length, seq_length].