modelzoo.transformers.pytorch.bert.bert_model.BertModel#

class modelzoo.transformers.pytorch.bert.bert_model.BertModel[source]#

Bases: torch.nn.Module

The model behaves as a bidirectional encoder (with only self-attention), following the architecture described in Attention is all you need by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.

Parameters

vocab_size (int, optional, defaults to 30522) – Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling BertModel or TFBertModel.
max_position_embeddings (int, optional, defaults to 512) – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
position_embedding_type (str, optional, defaults to ‘learned’) – The type of position embeddings, should either be ‘learned’ or ‘fixed’.
hidden_size (int, optional, defaults to 768) – Dimensionality of the encoder layers and the pooler layer.
embedding_dropout_rate (float, optional, defaults to 0.1) – The dropout ratio for the word embeddings.
embedding_pad_token_id (int, optional, defaults to 0) – The embedding vector at embedding_pad_token_id is not updated during training.
num_hidden_layers (int, optional, defaults to 12) – Number of hidden layers in the Transformer encoder.
layer_norm_epsilon (float, optional, defaults to 1e-5) – The epsilon used by the layer normalization layers.
num_heads (int, optional, defaults to 12) – Number of attention heads for each attention layer in the Transformer encoder.
attention_type (str, optional, defaults to ‘scaled_dot_product’) – The attention variant to execute. Currently accepts dot_product and scaled_dot_product.
attention_softmax_fp32 (bool, optional, defaults to True) – If True, attention softmax uses fp32 precision else fp16/bf16 precision
dropout_rate (float, optional, defaults to 0.1) – The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
nonlinearity – (string, optional, defaults to gelu): The non-linear activation function (function or string) in the encoder and pooler.
attention_dropout_rate (float, optional, defaults to 0.1) – The dropout ratio for the attention probabilities.
use_projection_bias_in_attention (bool, optional, defaults to True) – If True, bias is used on the projection layers in attention.
use_ffn_bias_in_attention (bool, optional, defaults to True) – If True, bias is used in the dense layer in the attention.
filter_size (int, optional, defaults to 3072) – Dimensionality of the feed-forward layer in the Transformer encoder.
use_ffn_bias – (bool, optional, defaults to True): If True, bias is used in the dense layer in the encoder.
initializer_range (float, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer as the default initializer.
num_segments (int, optional, defaults to 2) – The vocabulary size of the segments (sentence types).
embeddings_initializer (dict, optional, defaults to None) – Initializer for word embeddings
position_embeddings_initializer (dict, optional, defaults to None) – Initializer for position embeddings (if learned position embeddings)
segment_embeddings_initializer (dict, optional, defaults to None) – Initializer for segment embeddings
add_pooling_layer (bool, optional, defaults to True) – Whether to add the pooling layer for sequence classification.

Methods

forward

param input_ids: The id of input tokens

reset_parameters

__call__(*args: Any, **kwargs: Any) → Any#: Call self as a function.

__init__(vocab_size=50257, max_position_embeddings=1024, position_embedding_type='learned', hidden_size=768, embedding_dropout_rate=0.1, embedding_pad_token_id=0, mask_padding_in_positional_embed=False, num_hidden_layers=12, layer_norm_epsilon=1e-05, num_heads=12, attention_module='aiayn_attention', extra_attention_params={}, attention_type='scaled_dot_product', attention_softmax_fp32=True, dropout_rate=0.1, nonlinearity='gelu', pooler_nonlinearity=None, attention_dropout_rate=0.1, use_projection_bias_in_attention=True, use_ffn_bias_in_attention=True, filter_size=3072, use_ffn_bias=True, use_final_layer_norm=False, initializer_range=0.02, num_segments=2, default_initializer=None, embeddings_initializer=None, position_embeddings_initializer=None, segment_embeddings_initializer=None, add_pooling_layer=True, attention_kernel=None, **extra_args)[source]#

static __new__(cls, *args: Any, **kwargs: Any) → Any#

forward(input_ids=None, position_ids=None, segment_ids=None, attention_mask=None)[source]#

Parameters

input_ids (Tensor) – The id of input tokens Can be of shape ``[batch_size, seq_length]
position_ids (Tensor) – The position id of input tokens. Can be of shape [batch_size, seq_length]
segment_ids (Tensor) – The segment id of input tokens, indicating which sequence the token belongs to Can be of shape ``[batch_size, seq_length]
attention_mask (Tensor) – Can be 2D of shape [batch_size, seq_length], or 3D of shape [batch, query_length, seq_length], or 4D of shape [batch, num_heads, query_length, seq_length].

modelzoo.transformers.pytorch.bert.bert_model

modelzoo.transformers.pytorch.bert.bert_model.BertPooler