modelzoo.transformers.pytorch.bert.bert_model.BertModel#
- class modelzoo.transformers.pytorch.bert.bert_model.BertModel[source]#
Bases:
torch.nn.Module
The model behaves as a bidirectional encoder (with only self-attention), following the architecture described in Attention is all you need by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
- Parameters
vocab_size (
int
, optional, defaults to 30522) – Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by theinputs_ids
passed when callingBertModel
orTFBertModel
.max_position_embeddings (
int
, optional, defaults to 512) – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).position_embedding_type (
str
, optional, defaults to ‘learned’) – The type of position embeddings, should either be ‘learned’ or ‘fixed’.hidden_size (
int
, optional, defaults to 768) – Dimensionality of the encoder layers and the pooler layer.embedding_dropout_rate (
float
, optional, defaults to 0.1) – The dropout ratio for the word embeddings.embedding_pad_token_id (
int
, optional, defaults to 0) – The embedding vector at embedding_pad_token_id is not updated during training.num_hidden_layers (
int
, optional, defaults to 12) – Number of hidden layers in the Transformer encoder.layer_norm_epsilon (
float
, optional, defaults to 1e-5) – The epsilon used by the layer normalization layers.num_heads (
int
, optional, defaults to 12) – Number of attention heads for each attention layer in the Transformer encoder.attention_type (
str
, optional, defaults to ‘scaled_dot_product’) – The attention variant to execute. Currently acceptsdot_product
andscaled_dot_product
.attention_softmax_fp32 (
bool
, optional, defaults toTrue
) – If True, attention softmax uses fp32 precision else fp16/bf16 precisiondropout_rate (
float
, optional, defaults to 0.1) – The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.nonlinearity – (
string
, optional, defaults togelu
): The non-linear activation function (function or string) in the encoder and pooler.attention_dropout_rate (
float
, optional, defaults to 0.1) – The dropout ratio for the attention probabilities.use_projection_bias_in_attention (
bool
, optional, defaults toTrue
) – If True, bias is used on the projection layers in attention.use_ffn_bias_in_attention (
bool
, optional, defaults toTrue
) – If True, bias is used in the dense layer in the attention.filter_size (
int
, optional, defaults to 3072) – Dimensionality of the feed-forward layer in the Transformer encoder.use_ffn_bias – (
bool
, optional, defaults toTrue
): If True, bias is used in the dense layer in the encoder.initializer_range (
float
, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer as the default initializer.num_segments (
int
, optional, defaults to 2) – The vocabulary size of the segments (sentence types).embeddings_initializer (
dict
, optional, defaults to None) – Initializer for word embeddingsposition_embeddings_initializer (
dict
, optional, defaults to None) – Initializer for position embeddings (if learned position embeddings)segment_embeddings_initializer (
dict
, optional, defaults to None) – Initializer for segment embeddingsadd_pooling_layer (
bool
, optional, defaults to True) – Whether to add the pooling layer for sequence classification.
Methods
compute_input_embeddings
- param input_ids
The id of input tokens
reset_parameters
- __call__(*args: Any, **kwargs: Any) Any #
Call self as a function.
- __init__(vocab_size=50257, max_position_embeddings=1024, position_embedding_type='learned', hidden_size=768, embedding_dropout_rate=0.1, embedding_pad_token_id=0, mask_padding_in_positional_embed=False, rotary_dim=None, rope_theta=10000, num_relative_attention_buckets=32, alibi_trainable_slopes=False, pos_scaling_factor=1.0, num_hidden_layers=12, layer_norm_epsilon=1e-05, norm_first=False, embedding_layer_norm=True, num_heads=12, attention_module='aiayn_attention', extra_attention_params={}, attention_type='scaled_dot_product', attention_softmax_fp32=True, dropout_rate=0.1, nonlinearity='gelu', pooler_nonlinearity=None, attention_dropout_rate=0.1, use_projection_bias_in_attention=True, use_ffn_bias_in_attention=True, filter_size=3072, use_ffn_bias=True, use_final_layer_norm=False, initializer_range=0.02, num_segments=2, default_initializer=None, embeddings_initializer=None, position_embeddings_initializer=None, segment_embeddings_initializer=None, add_pooling_layer=True, **extra_args)[source]#
- static __new__(cls, *args: Any, **kwargs: Any) Any #
- forward(input_ids=None, attention_mask=None, position_ids=None, segment_ids=None)[source]#
- Parameters
input_ids (Tensor) – The id of input tokens Can be of shape ``[batch_size, seq_length]
position_ids (Tensor) – The position id of input tokens. Can be of shape
[batch_size, seq_length]
segment_ids (Tensor) – The segment id of input tokens, indicating which sequence the token belongs to Can be of shape ``[batch_size, seq_length]
attention_mask (Tensor) – Can be 2D of shape
[batch_size, seq_length]
, or 3D of shape[batch, query_length, seq_length]
, or 4D of shape[batch, num_heads, query_length, seq_length]
.