modelzoo.transformers.pytorch.t5.t5_model.T5ForConditionalGeneration#

class modelzoo.transformers.pytorch.t5.t5_model.T5ForConditionalGeneration[source]#

Bases: torch.nn.Module

T5 Model with a language modeling head on top.

Parameters
  • src_vocab_size (int, optional, defaults to 32128) – Source vocabulary size of the T5 model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling T5Model or TFT5Model.

  • tgt_vocab_size (int, optional, defaults to 32128) – Target vocabulary size of the T5 model. Only useful if set for Transformer variant where source and target vocabularies can be different.

  • d_model (int, optional, defaults to 512) – Size of the encoder layers and the pooler layer.

  • d_kv (int, optional, defaults to 64) – Size of the key, query, value projections per attention head. d_kv does not have tobe equal to d_model // num_heads.

  • d_ff (int, optional, defaults to 2048) – Size of the intermediate feed forward layer in each T5Block.

  • encoder_num_hidden_layers (int, optional, defaults to 6) – Number of hidden layers in the Transformer encoder.

  • decoder_num_hidden_layers (int, optional) – Number of hidden layers in the Transformer decoder. Will use the same value as num_layers if not set.

  • num_heads (int, optional, defaults to 8) – Number of attention heads for each attention layer in the Transformer encoder and decoder.

  • relative_attention_num_buckets (int, optional, defaults to 32) – The number of buckets to use for each attention layer.

  • norm_type (str, optional, defaults to “rmsnorm”) – Determines which type of layernorm to use. RMSNorm is the same as T5-style layernorm (no mean subtraction and bias correction).

  • dropout_rate (float, optional, defaults to 0.1) – The ratio for all dropout layers.

  • layer_norm_eps (float, optional, defaults to 1e-6) – The epsilon used by the layer normalization layers.

  • initializer_factor (float, optional, defaults to 1) – A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing).

  • encoder_nonlinearity (string, optional, defaults to "relu") –

    Type of feed forward layer to be used in encoder. Should be one of "relu" or "geglu" or

    "gelu". T5v1.1 uses the "geglu" feed forward projection. Original T5 uses "relu".

  • decoder_nonlinearity (string, optional, defaults to "relu") – Type of feed forward layer to be used in decoder. Should be one of "relu" or "geglu" or "gelu". T5v1.1 uses the "geglu" feed forward projection. Original T5 uses "relu".

  • use_cache (bool, optional, defaults to True) – Whether or not the model should return the last key/values attentions (not used by all models).

  • ( (position_embedding_type) – obj: string, optional, defaults to "relative"): The type of position embedding to use. Should be one of "fixed", "learned_absolute", "relative", or None. "fixed" uses a concatenation of sin curves to express relative position as used in the original Transformer paper. "learned_absolute" uses a learned vector for each position in the sequence. "relative" uses learned relative position embeddings as introduced in https://arxiv.org/abs/1803.02155, configured as done in the original T5 publication. None turns off position embedding altogether.

:param src_max_position_embeddings (int: obj: 512):

Maximum source sequence length to train using to train the model.

Parameters
  • optional – obj: 512): Maximum source sequence length to train using to train the model.

  • to (defaults) – obj: 512): Maximum source sequence length to train using to train the model.

:param tgt_max_position_embeddings (int: obj: 512):

Maximum target sequence length to train using to train the model.

Parameters
  • optional – obj: 512): Maximum target sequence length to train using to train the model.

  • to (defaults) – obj: 512): Maximum target sequence length to train using to train the model.

:param use_dropout_outside_residual_path (bool: obj: True):

Whether to set dropout calculations outside of the residual path. Set to True for T5, but False for Transformer.

Parameters
  • optional – obj: True): Whether to set dropout calculations outside of the residual path. Set to True for T5, but False for Transformer.

  • to (defaults) – obj: True): Whether to set dropout calculations outside of the residual path. Set to True for T5, but False for Transformer.

:param share_encoder_decoder_embedding (bool: obj: True):

Whether to share encoder/decoder embedding layer. Set to True for both T5 and Transformer models.

Parameters
  • optional – obj: True): Whether to share encoder/decoder embedding layer. Set to True for both T5 and Transformer models.

  • to (defaults) – obj: True): Whether to share encoder/decoder embedding layer. Set to True for both T5 and Transformer models.

:param tie_word_embeddings (bool: obj: True):

Whether to share embedding weights between decoder and language model head.

Parameters
  • optional – obj: True): Whether to share embedding weights between decoder and language model head.

  • to (defaults) – obj: True): Whether to share embedding weights between decoder and language model head.

:param relu_dropout_rate (int: obj: 0.1):

Dropout rate utilized in the FFN layer after applying relu activation function. This parameter is set to 0 for Transformer model, and set to dropout_rate for default T5 configuration. Transformer reference: https://github.com/tensorflow/tensor2tensor/blob/5623deb79cfcd28f8f8c5463b58b5bd76a81fd0d/tensor2tensor/models/transformer.py#L1811 T5 reference: https://github.com/huggingface/transformers/blob/v4.15.0/src/transformers/models/t5/modeling_t5.py#L261

Parameters
:param use_pre_encoder_decoder_dropout (bool: obj: False):

Whether to use dropout layer after positional embedding layer and encoder/decoder. This is set to False for T5 and True for Transformer.

Parameters
  • optional – obj: False): Whether to use dropout layer after positional embedding layer and encoder/decoder. This is set to False for T5 and True for Transformer.

  • to (defaults) – obj: False): Whether to use dropout layer after positional embedding layer and encoder/decoder. This is set to False for T5 and True for Transformer.

:param use_pre_encoder_decoder_layer_norm (bool: obj: True):

Whether to use layer norm before passing input tensors into encoder/decoder. This is set to True for T5 and False for Transformer.

Parameters
  • optional – obj: True): Whether to use layer norm before passing input tensors into encoder/decoder. This is set to True for T5 and False for Transformer.

  • to (defaults) – obj: True): Whether to use layer norm before passing input tensors into encoder/decoder. This is set to True for T5 and False for Transformer.

:param use_ffn_bias (bool: obj: False):

Whether to use bias in the hidden layer with relu activation. This is set to False for T5, and True for Transformer.

Parameters
  • optional – obj: False): Whether to use bias in the hidden layer with relu activation. This is set to False for T5, and True for Transformer.

  • to (defaults) – obj: False): Whether to use bias in the hidden layer with relu activation. This is set to False for T5, and True for Transformer.

:param lm_loss_weight (float: obj: 1.0):

Value that scales loss by the mean number of predictions per sequence in the dataset.

Parameters
  • optional – obj: 1.0): Value that scales loss by the mean number of predictions per sequence in the dataset.

  • to (default) – obj: 1.0): Value that scales loss by the mean number of predictions per sequence in the dataset.

  • use_transformer_initialization (bool, optional, defaults to False) – The Transformer model tends to converge best with a scaled variant on Xavier uniform initialization used for linear layers. This contrasts the initialization used for the original T5 paper, which uses He normal initialization for linear layers. Setting this flag to True switches the initialization to the Transformer specific scaled Xavier initialization.

Methods

compute_sequence_output

forward

labels (torch.LongTensor of shape (batch_size,), optional):

get_input_embeddings

get_output_embeddings

reset_parameters

set_input_embeddings

set_output_embeddings

tie_weights

Tie the weights between the input embeddings and the output embeddings and (if enabled) tie encoder/decoder weights.

__call__(*args: Any, **kwargs: Any) Any#

Call self as a function.

__init__(src_vocab_size=32128, tgt_vocab_size=32128, d_model=512, d_kv=64, d_ff=2048, encoder_num_hidden_layers=6, decoder_num_hidden_layers=None, num_heads=8, relative_attention_num_buckets=32, norm_type='rmsnorm', dropout_rate=0.1, relu_dropout_rate=None, layer_norm_epsilon=1e-06, initializer_factor=1.0, encoder_nonlinearity='relu', decoder_nonlinearity='relu', use_projection_bias_in_attention=False, attention_softmax_fp32=True, attention_kernel=None, use_cache=False, decoder_start_token_id=None, pad_token_id=0, position_embedding_type='relative', src_max_position_embeddings=512, tgt_max_position_embeddings=512, use_dropout_outside_residual_path=True, share_encoder_decoder_embedding=True, tie_word_embeddings=True, tie_encoder_decoder=False, use_pre_encoder_decoder_dropout=False, use_pre_encoder_decoder_layer_norm=True, use_ffn_bias=False, label_smoothing=0.0, use_transformer_initialization=False, **kwargs)[source]#
static __new__(cls, *args: Any, **kwargs: Any) Any#
forward(input_ids=None, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, encoder_outputs=None, past_key_values=None, labels=None, use_cache=None)[source]#
labels (torch.LongTensor of shape (batch_size,), optional):

Labels for computing the sequence classification/regression loss. Indices should be in [-100, 0, ..., config.vocab_size - 1]. All labels set to -100 are ignored (masked), the loss is only computed for labels in [0, ..., config.vocab_size]

Returns:

Examples:

>>> from transformers import T5Tokenizer, T5ForConditionalGeneration

>>> tokenizer = T5Tokenizer.from_pretrained('t5-small')
>>> model = T5ForConditionalGeneration.from_pretrained('t5-small')

>>> # training
>>> input_ids = tokenizer('The <extra_id_0> walks in <extra_id_1> park', return_tensors='pt').input_ids
>>> labels = tokenizer('<extra_id_0> cute dog <extra_id_1> the <extra_id_2>', return_tensors='pt').input_ids
>>> outputs = model(input_ids=input_ids, labels=labels)
>>> loss = outputs.loss
>>> logits = outputs.logits

>>> # inference
>>> input_ids = tokenizer("summarize: studies have shown that owning a dog is good for you", return_tensors="pt").input_ids  # Batch size 1
>>> outputs = model.generate(input_ids)
>>> print(tokenizer.decode(outputs[0], skip_special_tokens=True))
>>> # studies have shown that owning a dog is good for you.
tie_weights()[source]#

Tie the weights between the input embeddings and the output embeddings and (if enabled) tie encoder/decoder weights.