modelzoo.transformers.pytorch.bert.bert_pretrain_models.BertPretrainModel#
- class modelzoo.transformers.pytorch.bert.bert_pretrain_models.BertPretrainModel[source]#
Bases:
torch.nn.Module
Bert Model with two heads on top as done during the pretraining: a masked language modeling head and a next sentence prediction (classification) head. Following the paper: https://arxiv.org/abs/1810.04805.
- Parameters
disable_nsp (
bool
optional, defaults to False) – Whether to disable next-sentence-prediction and only use masked-language-model.mlm_loss_weight (
float
optional, defaults to 1.0) – The scaling factor for masked-language-model loss.label_smoothing (
float
optional, defaults to 0.0) – The label smoothing factor used during training.
Methods
- param input_ids
The id of input tokens. Can be of shape
[batch_size, seq_length]
get_input_embeddings
get_output_embeddings
reset_parameters
tie_weights
- __call__(*args: Any, **kwargs: Any) Any #
Call self as a function.
- __init__(disable_nsp=False, mlm_loss_weight=1.0, label_smoothing=0.0, num_classes=2, mlm_nonlinearity=None, vocab_size=50257, max_position_embeddings=1024, position_embedding_type='learned', embedding_pad_token_id=0, mask_padding_in_positional_embed=False, rotary_dim=None, rope_theta=10000, num_relative_attention_buckets=32, alibi_trainable_slopes=False, pos_scaling_factor=1.0, hidden_size=768, share_embedding_weights=True, num_hidden_layers=12, layer_norm_epsilon=1e-05, num_heads=12, attention_module='aiayn_attention', extra_attention_params={}, attention_type='scaled_dot_product', attention_softmax_fp32=True, dropout_rate=0.1, nonlinearity='gelu', pooler_nonlinearity=None, attention_dropout_rate=0.1, use_projection_bias_in_attention=True, use_ffn_bias_in_attention=True, filter_size=3072, use_ffn_bias=True, use_ffn_bias_in_mlm=True, use_output_bias_in_mlm=True, initializer_range=0.02, num_segments=2)[source]#
- Parameters
disable_nsp (
bool
optional, defaults to False) – Whether to disable next-sentence-prediction and only use masked-language-model.mlm_loss_weight (
float
optional, defaults to 1.0) – The scaling factor for masked-language-model loss.label_smoothing (
float
optional, defaults to 0.0) – The label smoothing factor used during training.
- static __new__(cls, *args: Any, **kwargs: Any) Any #
- forward(input_ids=None, attention_mask=None, position_ids=None, token_type_ids=None, masked_lm_positions=None, should_gather_mlm_labels=False)[source]#
- Parameters
input_ids (Tensor) – The id of input tokens. Can be of shape
[batch_size, seq_length]
attention_mask (Tensor) – Can be 2D of shape
[batch_size, seq_length]
, or 3D of shape[batch, query_length, seq_length]
, or 4D of shape[batch, num_heads, query_length, seq_length]
.position_ids (Tensor) – The position id of input tokens. Can be of shape
[batch_size, seq_length]
token_type_ids (Tensor) – The segment id of input tokens, indicating which sequence the token belongs to. Can be of shape ``[batch_size, seq_length]
masked_lm_positions (Tensor) – Position ids of mlm tokens. Shape
[batch_size, max_predictions_per_seq]